Professional Documents
Culture Documents
R.J.RamaSree
I. INTRODUCTION
Automated grading of descriptive answers can be viewed as
a text classification problem. Supervised learning methods can
be applied to classify the answers into appropriate ordinal
rubric based on the likelihood suggested by training samples.
The supervised learning process requires extracting various
text features from the documents meant as training set and
then train using a sophisticated machine learning algorithm.
Previous researches inferred that Decision Tree, Neural
Network and Support Vector Machine are individually the best
performing machine learning algorithms [1] for the problem of
automated grading of descriptive answers. For the purpose of
research covered under this paper, an exploratory idea is
carried out to verify if uniting the power of all the three
machine learning algorithms and collectively classifying the
answers would be better than classification done by individual
algorithms.
Uniting the power of multiple machine learning algorithms
is possible via Stacking ensemble learning otherwise called
stacking ensembling. In Stacking, multiple classifiers that
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
needs to comply with the resolved human score in training
example.
The data used for training and validation of the models are
answers written by students for 10 different questions. Data for
a question is considered as one unique dataset. So, we have a
total of 10 datasets. The questions that students are asked to
provide responses to are from diversified subjects ranging from
Chemistry to English Language Arts to Biology. Table 1 and
Table 2 details the number of samples and distribution of
scores across the ordinal rubrics possible for each dataset.
2.3. Hardware and the software used for the research
All experiments performed were executed on a Dell
Latitude E5430 laptop. The laptop is configured with Intel
Core i5 -3350M CPU @ 2.70 GHz and with 4 GB RAM
however Weka workbench is configured to use a maximum of
1 GB. The laptop runs on Windows 7 64 bit operating system.
For the purpose of designing and evaluating the
experiments, R statistical language and Weka machine learning
workbench are used. R is used for all pre-processing tasks
including dimensionality reduction where was Weka is used
for building the models using the pre-processed data and to
obtain the model performance statistics.
R statistical language is a GNU project that offers a
language and environment for statistical computing and
graphics. R provides a wide variety of statistical and graphical
techniques such as linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering,
etc., R is highly extensible as well [6].
A machine learning workbench called Weka is used for the
experiments. Weka stands for Waikato Environment for
Knowledge Analysis and it is a free offering from University
of Waikato, New Zealand. This workbench has a user-friendly
interface and it incorporates numerous options to develop and
evaluate machine learning models [7] [8]. These models can be
utilized for a variety of purposes, including automated essay
scoring.
2.4. Data pre-processing
To be able to build models required for the research, the
raw datasets that consists the answers and scores need to be
transformed into a format that can be used by Weka. The steps
involved in transformation of the raw data into a usable form is
termed as pre-processing. All required pre-processing for
experimentation is done using an R language text mining
library called tm [9].
The first step is to read the raw data file into an R vector.
Post reading, the vector is used to create a corpus, which refers
to a collection of text documents. Each text document in this
case refers to an answer in the raw data file. The corpus was
then passed through series of R functions which are again part
of tm package so as to change the case to lower case, strip
white spaces, remove numbers, and remove stop words,
punctuations.
From the corpus, A sparse matrix is created using
DocumentTermMatrix function in which the rows of the matrix
indicate documents and the columns indicate features i.e.,
words. Only Unigrams are extracted from text to form the
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
classifier in order to arrive at measurements for comparison
purposes.
2.7. Model performance evaluation metrics and Measurements
Due to availability of small number of samples in each
dataset, 10- fold cross-validation approach is chosen for this
research over the partitioning approach of datasets into training
and test sets. Four performance metrics are chosen to assess
models future performance with unseen data. The metrics are
Accuracy [12], Cohens kappa (kappa) [13], F Score [14] and
Mean Absolute Error (MAE) [15].
III. MEASUREMENTS
Various measurements obtained with models built during
the experiment, are described in this section.
3.1. Baseline measurements
Using the datasets obtained post pre-processing and
application of Chi-square filter, models are built using
Decision tree, Neural network, and Support vector machine
with linear kernel. From these models and through 10-fold
cross-validation, measurements across Accuracy, MAE, F
Score and kappa performance metrics are recorded. These
baseline measurements are shown in Table 3.
3.2. Measurements from stacking models
Using the pre-processed and dimensionality reduced
datasets, models are built through stacking all three algorithms
as base classifiers and with each of the algorithms as meta
classifier. Table 4 shows the various model performance
metrics measurements recorded with stacking models.
3.3. Ranks computation and Ranks consolidation
The performance measurements obtained need to be
compared so as to arrive at meaningful conclusions. For this
purpose, ranks need to be assigned to the measurements.
Except for the MAE metric, for all other metrics, the higher the
metric value recorded the better the model is in terms of
performance. For MAE metric, the lower the measurement, the
better the model is. With this rank computation principles in
the picture MS- Excels Rank function is used to compare the
measurements obtained across each dataset and by metric. The
output of Rank function is ranks assigned to each measurement
in comparison with other measurements with in the metric
category and by dataset. Now, ranks obtained are summed by
the machine learning technique across the datasets. This step
gave the Rank sum for each machine learning technique used
in this research. The closer the rank sum is to 0, the better the
performance of the machine learning technique is. Again, MSExcels Rank function is used to assign final rank to each
machine learning technique and this rank is based on the rank
sum. Again, the rank assigned is closer is to 0, the better the
model in terms of performance. Figure - 2 represents the final
ranks assigned to each of the machine learning technique
considered in this research.
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
Minimum score
1st quartile
Median
Mean
3rd quartile
Maximum score
Number of samples
1.492
1672
1.73
1278
0.9947
1891
0.691
1738
0.2864
1795
0.2532
1797
0.7148
1799
1.127
1799
1.105
1798
10
1.177
1640
0
22.72
13.14
1
25.65
25.5
2
31.33
36.54
3
20.27
24.8
23.84
52.82
23.32
Not Applicable
38.49
53.91
7.5
Not Applicable
77.49
18.27
2.33
1.89
84.3
8.9
3.95
2.83
51.8
24.9
23.29
Not Applicable
30.51
26.29
43.19
Not Applicable
24.13
41.26
34.59
Not Applicable
10
17.68
46.95
35.36
Not Applicable
Original Dimension
2608
2249
1876
75
2
2578
2269
2014
1823
1779
67
Reduced Dimension
2745
2210
69
63
3
46
5
93
79
31
7
82
57
10
Accuracy
Kappa
F Score
MAE
Dataset 1
57
0.42
0.57
0.51
Dataset 2
43
0.20
0.41
0.70
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
Dataset 3
52
0.11
0.39
0.51
Dataset 4
72
0.47
0.60
0.28
Dataset 5
85
0.55
0.58
0.17
Dataset 6
88
0.56
0.59
0.15
Dataset 7
67
0.45
0.63
0.43
Dataset 8
62
0.41
0.60
0.49
Dataset 9
67
0.49
0.67
0.35
Dataset 10
74
0.58
0.70
0.28
Dataset
Accuracy
Kappa
F Score
MAE
Dataset 1
61
0.48
0.62
0.44
Dataset 2
52
0.32
0.50
0.54
Dataset 3
54
0.08
0.34
0.46
Dataset 4
74
0.53
0.70
0.26
Dataset 5
85
0.54
0.48
0.17
Dataset 6
89
0.58
0.59
0.14
Dataset 7
69
0.49
0.65
0.39
Dataset 8
65
0.45
0.62
0.46
Dataset 9
69
0.52
0.69
0.32
Dataset 10
74
0.58
0.71
0.28
Dataset
Accuracy
Kappa
F Score
MAE
Dataset 1
60
0.46
0.60
0.46
Dataset 2
51
0.31
0.50
0.57
Dataset 3
54
0.07
0.33
0.47
Dataset 4
73
0.51
0.68
0.27
Dataset 5
84
0.54
0.55
0.18
Dataset 6
89
0.56
0.62
0.14
Dataset 7
69
0.49
0.65
0.39
Dataset 8
66
0.48
0.63
0.44
Dataset 9
69
0.53
0.70
0.32
Dataset 10
74
0.58
0.71
0.29
Accuracy
Kappa
F Score
MAE
Dataset 1
56
0.41
0.56
0.26
Dataset 2
49
0.27
0.49
0.32
Dataset 3
53
0.06
0.43
0.40
Dataset 4
74
0.52
0.74
0.26
Dataset 5
84
0.53
0.83
0.12
Dataset 6
87
0.48
0.86
0.08
Dataset 7
67
0.46
0.67
0.29
Dataset 8
66
0.47
0.64
0.31
Dataset 9
68
0.51
0.68
0.29
Dataset 10
74
0.58
0.74
0.25
Accuracy
Kappa
F Score
MAE
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
Dataset 1
61
0.47
0.60
0.26
Dataset 2
44
0.20
0.43
0.32
Dataset 3
53
0.03
0.41
0.40
Dataset 4
71
0.45
0.68
0.27
Dataset 5
84
0.54
0.82
0.12
Dataset 6
88
0.54
0.87
0.08
Dataset 7
69
0.49
0.68
0.29
Dataset 8
66
0.46
0.64
0.31
Dataset 9
69
0.53
0.69
0.30
Dataset 10
74
0.58
0.74
0.25
Dataset
Accuracy
Kappa
F Score
MAE
Dataset 1
60
0.46
0.60
0.20
Dataset 2
59
0.31
0.51
0.24
Dataset 3
63
0.07
0.43
0.30
Dataset 4
74
0.53
0.74
0.17
Dataset 5
85
0.57
0.84
0.08
Dataset 6
89
0.56
0.88
0.06
Dataset 7
69
0.49
0.69
0.20
Dataset 8
67
0.50
0.67
0.22
Dataset 9
70
0.53
0.70
0.20
Dataset 10
74
0.57
0.74
0.18
Accuracy
Kappa
F Score
MAE
Decision tree
66.70
0.42
0.58
0.39
Neural network
69.20
0.46
0.59
0.35
68.90
0.45
0.60
0.35
67.80
0.43
0.66
0.26
67.90
0.43
0.66
0.26
71.00
0.46
0.68
0.19
FIGURE 2 . FINAL RANKS ASSIGNED TO VARIOUS MACHINE LEARNING TECHNIQUES CONSIDERED IN THIS RESEARCH
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
REFERENCES
[1] Syed M. Fahad Latifi et al., Towards Automated Scoring using
Open-Source Technologies in: Paper Presented at the 2013
Annual Meeting of the Canadian Society for the Study of
Education Victoria, British Columbia, 2013, pp.13-14.
[2] David H. Wolpert (1992). Stacked generalization. Neural
Networks. 5:241-259.
[3] Raymond J. Mooney, Machine Learning: Ensembles,
University
of
Texas
at
Austin,
www.cs.utexas.edu/~mooney/cs391L/slides/ensembles.pdf,
Accessed: 26 December 2014.
[4] The Hewlett Foundation: Automated Essay Scoring,
http://www.kaggle.com/c/asap-aes, Accessed: 17 September
2014.
[5]
Code
for
evaluation
metric
and
benchmarks,
https://www.kaggle.com/c/asapaes/data?Training_Materials.zip,
10
February
2012,
Accessed: 17 September 2014.
[6] The R Project for Statistical Computing, http://www.rproject.org/, Accessed: 17 September 2014.
[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,
Peter Reutemann, Ian H. Witten (2009); The WEKA Data
Mining Software: An Update; SIGKDD Explorations,
Volume 11, Issue 1.
[8] Ian H. Witten and Eibe Frank (2005). Data Mining: Practical
Machine Learning Tools and Techniques. 2nd Edition,
Morgan Kaufmann, San Francisco.
[9] I. Feinerer, K. Hornik, and D. Meyer. Text mining
infrastructure in R. Journal of Statistical Software, 25(5):154,
March
2008.
ISSN
1548-7660.
URL
http://www.jstatsoft.org/v25/i05, Accessed: 17 September
2014.
[10] Bob Durrant, Mitigating the Curse of Dimensionality in
Machine
Learning,
www.cercia.com/projects/student_projects/Posters_07/BobDu
rrant.pdf, Accessed: 24 December 2014.
[11] Jasmina NOVAKOVI, Perica STRBAC, Dusan
BULATOVI, Toward Optimal Feature Selection using
Ranking Methods and Classification Algorithms, Yugoslav
Journal of Operations Research, 21 (2011), Number 1, 119135
[12] Evaluation of Classifiers Performance, Machine Learning
Corner,
http://mlcorner.wordpress.com/2013/04/30/evaluation-ofclassifiers-performance/, Accessed: 30 November 2014
[13] What is Kappa coefficient (Cohen's Kappa),
http://www.pmean.com/definitions/kappa.htm Accessed: 30
November 2014
[14]
Mean
FScore,
Kaggle,
http://www.kaggle.com/wiki/MeanFScore, Accessed: 30
November 2014
[15] Cort J. Willmott, Kenji Matsuura, Advantages of the mean
absolute error (MAE) over the root mean square error
(RMSE) in assessing average model performance, Climate
Research, Vol. 30: 7982, 2005