You are on page 1of 7

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Enhancing the Performance of Automated Grading of


Descriptive Answers through Stacking
Sunil Kumar C

R.J.RamaSree

Research and Development Center


Bharathiar University
Coimbatore, India
sunil_sixsigma@yahoo.com

Department of Computer Science


Rashtriya Sanskrit Vidyapeetha
Tirupati, India
rjramasree@yahoo.com

AbstractIn this paper, Stacking ensembling technique is


implemented on ten examination answers data sets using Weka
workbench. Neural Network, Decision Tree and Support Vector
Machine are the machine learning algorithms chosen to perform
stacking. As stacking requires base classifiers and meta classifiers
to be defined by the experimenter, multiple experiments are
conducted to evaluate various combinations and permutations
possible with the chosen algorithms. It is found that defining
Neural Network, Decision Tree and Support Vector Machine as
base classifiers and Support Vector Machine as meta classifier
yielded the best classification measured through 10-fold crossvalidation across Accuracy, Kappa, Mean Absolute Error and F
Score metrics. It is observed that stacking approach yielded
better performance across the metrics when compared to
performances obtained from using each of the machine learning
algorithms individually. Also, the above said base classifier and
meta classifier combination is proved as the best combination
that yielded best values across the performance metrics. With the
said combination, 89% and 59% is the best and worst
classification accuracies obtained in the data sets; 71% average
accuracy observed across any data set.
Index TermsText classification, ensembling, stacking, neural
network, decision tree, support vector machine.

I. INTRODUCTION
Automated grading of descriptive answers can be viewed as
a text classification problem. Supervised learning methods can
be applied to classify the answers into appropriate ordinal
rubric based on the likelihood suggested by training samples.
The supervised learning process requires extracting various
text features from the documents meant as training set and
then train using a sophisticated machine learning algorithm.
Previous researches inferred that Decision Tree, Neural
Network and Support Vector Machine are individually the best
performing machine learning algorithms [1] for the problem of
automated grading of descriptive answers. For the purpose of
research covered under this paper, an exploratory idea is
carried out to verify if uniting the power of all the three
machine learning algorithms and collectively classifying the
answers would be better than classification done by individual
algorithms.
Uniting the power of multiple machine learning algorithms
is possible via Stacking ensemble learning otherwise called
stacking ensembling. In Stacking, multiple classifiers that

belong to absolutely different classes of machine learning


methods use all the training data and predict the classes. For
classification voting method is then applied to determine the
correct class [2].
In [3], researcher stated that by combing multiple
independent and diverse decisions each of which is at least
more accurate than random guessing, random errors cancel
each other out, and correct decisions are reinforced. The
research output covered under this paper aims to substantiate
this statement and confirm that the statement holds valid for
automated grading of descriptive answers domain too.
The rest of this paper is organized as follows. Section 2
discusses the data used, experimental setup, the preliminaries
of the tools and techniques used. Section 3 describes the
models built and measurements made during the experiments.
Analysis of experimental results and conclusions are dealt with
in Section 4.
II. EXPERIMENTAL SETUP
The setup in which the experiments are conducted for this
paper are specified in this section.
2.1. Data collection
In February 2012, The William and Flora Hewlett
Foundation (Hewlett) sponsored the Automated Student
Assessment Prize (ASAP) [4] to machine learning specialists
and data scientists to develop an automated scoring algorithm
for student-written essays. As part of this competition, the
competitors are provided with hand scored essays under 10
different prompts which are questions to which answers were
obtained from Students. These answers are the datasets. All the
10 essays prompts are used for the purpose of this research.
2.2. Data characteristics
All the graded essays from ASAP are according to specific
data characteristics. All responses were written by students of
Grade 10. On an average, each essay is approximately 50
words in length. Some are more dependent upon source
materials than others [4]. All the documents are in ASCII text
followed by a human score, a resolved final score was given in
cases there is a variance found with scores provided by two
human scorers [5]. For the purpose of evaluation of the
performance of the model, the score predicted by the model

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
needs to comply with the resolved human score in training
example.
The data used for training and validation of the models are
answers written by students for 10 different questions. Data for
a question is considered as one unique dataset. So, we have a
total of 10 datasets. The questions that students are asked to
provide responses to are from diversified subjects ranging from
Chemistry to English Language Arts to Biology. Table 1 and
Table 2 details the number of samples and distribution of
scores across the ordinal rubrics possible for each dataset.
2.3. Hardware and the software used for the research
All experiments performed were executed on a Dell
Latitude E5430 laptop. The laptop is configured with Intel
Core i5 -3350M CPU @ 2.70 GHz and with 4 GB RAM
however Weka workbench is configured to use a maximum of
1 GB. The laptop runs on Windows 7 64 bit operating system.
For the purpose of designing and evaluating the
experiments, R statistical language and Weka machine learning
workbench are used. R is used for all pre-processing tasks
including dimensionality reduction where was Weka is used
for building the models using the pre-processed data and to
obtain the model performance statistics.
R statistical language is a GNU project that offers a
language and environment for statistical computing and
graphics. R provides a wide variety of statistical and graphical
techniques such as linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering,
etc., R is highly extensible as well [6].
A machine learning workbench called Weka is used for the
experiments. Weka stands for Waikato Environment for
Knowledge Analysis and it is a free offering from University
of Waikato, New Zealand. This workbench has a user-friendly
interface and it incorporates numerous options to develop and
evaluate machine learning models [7] [8]. These models can be
utilized for a variety of purposes, including automated essay
scoring.
2.4. Data pre-processing
To be able to build models required for the research, the
raw datasets that consists the answers and scores need to be
transformed into a format that can be used by Weka. The steps
involved in transformation of the raw data into a usable form is
termed as pre-processing. All required pre-processing for
experimentation is done using an R language text mining
library called tm [9].
The first step is to read the raw data file into an R vector.
Post reading, the vector is used to create a corpus, which refers
to a collection of text documents. Each text document in this
case refers to an answer in the raw data file. The corpus was
then passed through series of R functions which are again part
of tm package so as to change the case to lower case, strip
white spaces, remove numbers, and remove stop words,
punctuations.
From the corpus, A sparse matrix is created using
DocumentTermMatrix function in which the rows of the matrix
indicate documents and the columns indicate features i.e.,
words. Only Unigrams are extracted from text to form the

matrix. Each cell in the matrix contains a value that represents


the number of times the word in the column heading appears in
the document specified in the row. The last step is to convert
the sparse matrix cell values into categorical values from
numeric type. This is required because the number of times a
word appeared in a sentence does not relate to the score
obtained rather it is the presence of word in the sentence that is
more relevant in scoring. A custom function is written to
convert the numeric values into Yes or No categorical
values. A cell with a value equal to one or more is replaced
with Yes otherwise is replaced with No. The custom
function is called for each cell in the sparse matrix. Post
execution of the function on all cells in the sparse matrix, all
the cells will only have factors 1 representing Yes or 0
representing No. The thus obtained matrices are now ready
for building models.
2.5. High dimensionality issue with datasets
One significant problem observed with the matrices
obtained with the pre-processing is high dimensionality. It is
observed that each dataset had at least 1779 dimensions even
when only unigrams are extracted from the text. This lead to a
problem called curse of dimensionality [10]. In order to avoid
this problem, for the purpose of this research, Chi-square
attribute evaluation filter [11] is used. Applying this filter
effectively decreased the number of features in each of the
matrix by at least 95% which makes it possible to build models
in the environment where the experiment is done and to carry
out further research on the data sets. Figure - 1 shows the
comparison between the number of features in each datasets
pre and post dimensionality reduction with Chi-Square
attribute filter.
2.6. Model building
Decision tree, Neural network and Support vector machine
with linear kernel machine learning algorithms are individually
used to build models and compute various measurements
across the performance evaluation metrics that are used for
baseline purposes in this research. All these three machine
learning algorithms are then used to perform stacking
ensembling in order build models and come up with
measurements across the various performance metrics. These
measurements recorded from stacking ensembling are
compared to baseline measurements in order to derive
conclusions for the research.
Stacking requires defining two different classes of
algorithms namely base classifiers and meta classifier. Each
algorithm defined as a base classifier builds its own
independent model from the training samples and predicts
classes for the test samples. The algorithm defined as meta
classifier consolidates the predictions made by the base
classifiers and arrives at the final class predictions for the test
samples however meta classifier is not mandated to go with the
consolidated approach but it can have predictions of its own
which can override the predictions made by base classifiers.
With three different algorithms in this research, three stacking
models are built for each dataset with each algorithm as meta

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
classifier in order to arrive at measurements for comparison
purposes.
2.7. Model performance evaluation metrics and Measurements
Due to availability of small number of samples in each
dataset, 10- fold cross-validation approach is chosen for this
research over the partitioning approach of datasets into training
and test sets. Four performance metrics are chosen to assess
models future performance with unseen data. The metrics are
Accuracy [12], Cohens kappa (kappa) [13], F Score [14] and
Mean Absolute Error (MAE) [15].
III. MEASUREMENTS
Various measurements obtained with models built during
the experiment, are described in this section.
3.1. Baseline measurements
Using the datasets obtained post pre-processing and
application of Chi-square filter, models are built using
Decision tree, Neural network, and Support vector machine
with linear kernel. From these models and through 10-fold
cross-validation, measurements across Accuracy, MAE, F
Score and kappa performance metrics are recorded. These
baseline measurements are shown in Table 3.
3.2. Measurements from stacking models
Using the pre-processed and dimensionality reduced
datasets, models are built through stacking all three algorithms
as base classifiers and with each of the algorithms as meta
classifier. Table 4 shows the various model performance
metrics measurements recorded with stacking models.
3.3. Ranks computation and Ranks consolidation
The performance measurements obtained need to be
compared so as to arrive at meaningful conclusions. For this
purpose, ranks need to be assigned to the measurements.
Except for the MAE metric, for all other metrics, the higher the
metric value recorded the better the model is in terms of
performance. For MAE metric, the lower the measurement, the
better the model is. With this rank computation principles in
the picture MS- Excels Rank function is used to compare the
measurements obtained across each dataset and by metric. The
output of Rank function is ranks assigned to each measurement
in comparison with other measurements with in the metric
category and by dataset. Now, ranks obtained are summed by
the machine learning technique across the datasets. This step
gave the Rank sum for each machine learning technique used
in this research. The closer the rank sum is to 0, the better the
performance of the machine learning technique is. Again, MSExcels Rank function is used to assign final rank to each
machine learning technique and this rank is based on the rank
sum. Again, the rank assigned is closer is to 0, the better the
model in terms of performance. Figure - 2 represents the final
ranks assigned to each of the machine learning technique
considered in this research.

3.4. Confirming the better technique through averages


Another perspective with which better machine learning
technique for this problem is by taking averages of each
performance metric across datasets. Expect for the MAE
metric, for all other metrics, the highest average value is the
best performer. For MAE, the lowest average value is the best
performer. Table 5 shows the averages obtained across the
performance metrics that help identifying the best performing
technique.
IV. RESULTS DISCUSSION AND CONCLUSIONS
The primary focus and sole objective for this research
remains to verify if stacking ensembling technique would be
any better than normal classifier for the automated grading of
descriptive answers problem.
From the average performance measurements and ranks, it
is evident that stacking Decision tree, Neural network and
SVM with linear kernel classifiers and using SVM with linear
kernel as meta classifier is the best technique from among all
the techniques compared in this research. It can be seen that the
technique confirmed as the best technique triumphed in each of
the average performance metrics as well.
When compared to average of performance metrics
recorded for Decision tree, Stacking with SVM using linear
kernel meta classifier is 4.3% better in terms of accuracy, 0.4
better in terms of Kappa, 0.10 better in terms of F Score and
0.20 lower in terms of MAE. Though, stacking approach is not
yielding a significant improvement in terms of accuracy, it can
observed that the best classifier technique is predicting the
classes with better confidence while it is also attempting to
minimize the errors. It can also be observed from MAE
reduction that even if a wrong prediction is made by the
classifier, the distance between the actual class and the
predicted class is minimized to a large extent therefore making
the classifier more reliable.

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

TABLE 1. SUMMARY OF SCORES FIELD IN THE DATASETS


Dataset / Summary
property
1

Minimum score

1st quartile

Median

Mean

3rd quartile

Maximum score

Number of samples

1.492

1672

1.73

1278

0.9947

1891

0.691

1738

0.2864

1795

0.2532

1797

0.7148

1799

1.127

1799

1.105

1798

10

1.177

1640

TABLE 2. PERCENTAGES OF SCORES DISTRIBUTION IN THE DATASETS


Dataset / Score
1
2

0
22.72
13.14

1
25.65
25.5

2
31.33
36.54

3
20.27
24.8

23.84

52.82

23.32

Not Applicable

38.49

53.91

7.5

Not Applicable

77.49

18.27

2.33

1.89

84.3

8.9

3.95

2.83

51.8

24.9

23.29

Not Applicable

30.51

26.29

43.19

Not Applicable

24.13

41.26

34.59

Not Applicable

10

17.68

46.95

35.36

Not Applicable

Original Dimension
2608
2249

1876

75
2

2578

2269

2014

1823

1779

67

Reduced Dimension
2745
2210

69

63
3

46
5

93

79

31
7

82

57

10

FIGURE - 1. COMPARISON OF ORIGINAL DIMENSIONS AND REDUCED DIMENSIONS BY CHI-SQUARE FILTER


TABLE 3. BASELINE MEASUREMENTS ACROSS PERFORMANCE METRICS
A) Performance Metrics with Decision tree models
Dataset

Accuracy

Kappa

F Score

MAE

Dataset 1

57

0.42

0.57

0.51

Dataset 2

43

0.20

0.41

0.70

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Dataset 3

52

0.11

0.39

0.51

Dataset 4

72

0.47

0.60

0.28

Dataset 5

85

0.55

0.58

0.17

Dataset 6

88

0.56

0.59

0.15

Dataset 7

67

0.45

0.63

0.43

Dataset 8

62

0.41

0.60

0.49

Dataset 9

67

0.49

0.67

0.35

Dataset 10

74

0.58

0.70

0.28

Dataset

Accuracy

Kappa

F Score

MAE

Dataset 1

61

0.48

0.62

0.44

Dataset 2

52

0.32

0.50

0.54

Dataset 3

54

0.08

0.34

0.46

Dataset 4

74

0.53

0.70

0.26

Dataset 5

85

0.54

0.48

0.17

Dataset 6

89

0.58

0.59

0.14

Dataset 7

69

0.49

0.65

0.39

Dataset 8

65

0.45

0.62

0.46

Dataset 9

69

0.52

0.69

0.32

Dataset 10

74

0.58

0.71

0.28

Dataset

Accuracy

Kappa

F Score

MAE

Dataset 1

60

0.46

0.60

0.46

Dataset 2

51

0.31

0.50

0.57

Dataset 3

54

0.07

0.33

0.47

Dataset 4

73

0.51

0.68

0.27

Dataset 5

84

0.54

0.55

0.18

Dataset 6

89

0.56

0.62

0.14

Dataset 7

69

0.49

0.65

0.39

Dataset 8

66

0.48

0.63

0.44

Dataset 9

69

0.53

0.70

0.32

Dataset 10

74

0.58

0.71

0.29

B) Performance Metrics with Neural network models

C) Performance Metrics with SVM with linear kernel models

TABLE 4. PERFORMANCE MEASUREMENTS FROM STACKING MODELS


A) Performance Metrics with Decision tree as meta classifier
Dataset

Accuracy

Kappa

F Score

MAE

Dataset 1

56

0.41

0.56

0.26

Dataset 2

49

0.27

0.49

0.32

Dataset 3

53

0.06

0.43

0.40

Dataset 4

74

0.52

0.74

0.26

Dataset 5

84

0.53

0.83

0.12

Dataset 6

87

0.48

0.86

0.08

Dataset 7

67

0.46

0.67

0.29

Dataset 8

66

0.47

0.64

0.31

Dataset 9

68

0.51

0.68

0.29

Dataset 10

74

0.58

0.74

0.25

B) Performance Metrics with Neural network as meta classifier


Dataset

Accuracy

Kappa

F Score

MAE

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Dataset 1

61

0.47

0.60

0.26

Dataset 2

44

0.20

0.43

0.32

Dataset 3

53

0.03

0.41

0.40

Dataset 4

71

0.45

0.68

0.27

Dataset 5

84

0.54

0.82

0.12

Dataset 6

88

0.54

0.87

0.08

Dataset 7

69

0.49

0.68

0.29

Dataset 8

66

0.46

0.64

0.31

Dataset 9

69

0.53

0.69

0.30

Dataset 10

74

0.58

0.74

0.25

Dataset

Accuracy

Kappa

F Score

MAE

Dataset 1

60

0.46

0.60

0.20

Dataset 2

59

0.31

0.51

0.24

Dataset 3

63

0.07

0.43

0.30

Dataset 4

74

0.53

0.74

0.17

Dataset 5

85

0.57

0.84

0.08

Dataset 6

89

0.56

0.88

0.06

Dataset 7

69

0.49

0.69

0.20

Dataset 8

67

0.50

0.67

0.22

Dataset 9

70

0.53

0.70

0.20

Dataset 10

74

0.57

0.74

0.18

C) Performance Metrics with SVM with linear kernel as meta classifier

TABLE 5. AVERAGES OF PERFORMANCE MEASUREMENTS ACROSS DATASETS


Machine Learning Technique used for Classification

Accuracy

Kappa

F Score

MAE

Decision tree

66.70

0.42

0.58

0.39

Neural network

69.20

0.46

0.59

0.35

SVM with Linear Kernel

68.90

0.45

0.60

0.35

Stacking with Decision tree meta classifier

67.80

0.43

0.66

0.26

Stacking with Neural network meta classifier

67.90

0.43

0.66

0.26

Stacking with SVM using linear kernel meta classifier

71.00

0.46

0.68

0.19

FIGURE 2 . FINAL RANKS ASSIGNED TO VARIOUS MACHINE LEARNING TECHNIQUES CONSIDERED IN THIS RESEARCH

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
REFERENCES
[1] Syed M. Fahad Latifi et al., Towards Automated Scoring using
Open-Source Technologies in: Paper Presented at the 2013
Annual Meeting of the Canadian Society for the Study of
Education Victoria, British Columbia, 2013, pp.13-14.
[2] David H. Wolpert (1992). Stacked generalization. Neural
Networks. 5:241-259.
[3] Raymond J. Mooney, Machine Learning: Ensembles,
University
of
Texas
at
Austin,
www.cs.utexas.edu/~mooney/cs391L/slides/ensembles.pdf,
Accessed: 26 December 2014.
[4] The Hewlett Foundation: Automated Essay Scoring,
http://www.kaggle.com/c/asap-aes, Accessed: 17 September
2014.
[5]
Code
for
evaluation
metric
and
benchmarks,
https://www.kaggle.com/c/asapaes/data?Training_Materials.zip,
10
February
2012,
Accessed: 17 September 2014.
[6] The R Project for Statistical Computing, http://www.rproject.org/, Accessed: 17 September 2014.
[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,
Peter Reutemann, Ian H. Witten (2009); The WEKA Data
Mining Software: An Update; SIGKDD Explorations,
Volume 11, Issue 1.
[8] Ian H. Witten and Eibe Frank (2005). Data Mining: Practical
Machine Learning Tools and Techniques. 2nd Edition,
Morgan Kaufmann, San Francisco.
[9] I. Feinerer, K. Hornik, and D. Meyer. Text mining
infrastructure in R. Journal of Statistical Software, 25(5):154,
March
2008.
ISSN
1548-7660.
URL
http://www.jstatsoft.org/v25/i05, Accessed: 17 September
2014.
[10] Bob Durrant, Mitigating the Curse of Dimensionality in
Machine
Learning,
www.cercia.com/projects/student_projects/Posters_07/BobDu
rrant.pdf, Accessed: 24 December 2014.
[11] Jasmina NOVAKOVI, Perica STRBAC, Dusan
BULATOVI, Toward Optimal Feature Selection using
Ranking Methods and Classification Algorithms, Yugoslav
Journal of Operations Research, 21 (2011), Number 1, 119135
[12] Evaluation of Classifiers Performance, Machine Learning
Corner,
http://mlcorner.wordpress.com/2013/04/30/evaluation-ofclassifiers-performance/, Accessed: 30 November 2014
[13] What is Kappa coefficient (Cohen's Kappa),
http://www.pmean.com/definitions/kappa.htm Accessed: 30
November 2014
[14]
Mean
FScore,
Kaggle,
http://www.kaggle.com/wiki/MeanFScore, Accessed: 30
November 2014
[15] Cort J. Willmott, Kenji Matsuura, Advantages of the mean
absolute error (MAE) over the root mean square error
(RMSE) in assessing average model performance, Climate
Research, Vol. 30: 7982, 2005

You might also like