Professional Documents
Culture Documents
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
Assistant Professor, Master of Computer Applications Rao & Naidu Engineering College, Ongole Andhra Pradesh, India
2 Associate Professor Department of Computer Science Sri Padmavati Mahila Viswavidyalayam (Womens University), Tirupati Andhra Pradesh, India
defined class label. Classification algorithms have a wide range of applications like churn pre-diction, fraud detection, artificial intelligence, credit card rating, etc., [2], [3], [4]. Also there are many classification algorithms available in literature but decision tree is the most commonly used because of its ease of implementation and easier to understand compared to other classification algorithms. Decision tree classifiers are used extensively for diagnosis of breast tumor, ultrasonic images, ovarian cancer, heart sound diagnosis, etc., [5]-[10]. Decision Tree classification algorithm can be implemented in a serial or parallel fashion based on the volume of data, memory space available on the computer resource and scalability of the algorithm. In this study the experiments are conducted and analyzed the accuracy evaluation of commonly used decision tree algorithms on two tumor data sets. Decision trees play a vital role in the field of medical diagnosis to diagnose the problem of a patient. In this paper, accuracy of various decision tree classifiers and their time complexity are compared on Tumor Data sets. The rest of the paper is organized as follows. In section 2, theory and the review of decision tree induction algorithms, overview of related work and introduction of data sets are presented. The experimental results and the performance of most frequently used decision tree classifiers with comparison are presented in section 3 and conclusion in section 4.
1. Introduction
Tumor is abnormal cell growth that can be either benign or malignant. Benign tumors are non invasive while malignant tumors are cancerous and spread to other part of the body. Early diagnosis and treatment helps to prevent the spread of tumor. Data Mining is a convenient way of extracting patterns, which represents knowledge implicitly stored in datasets and focuses on issues relating to their feasibility, usefulness, effectiveness and scalability. Data are preprocessed by data cleaning, data integration, data selection and data transformation. Data mining functionalities are Classification, Association, Correlation analysis, Prediction, cluster analysis, etc. Classification is a fundamental task in data mining [1]. Classification is done through grouping of similar data objects together. It can be defined as supervised learning algorithm as it assigns class labels to data objects based on the relationship between the data items with a preVolume 2, Issue 4 July August 2013
2. Background
2.1 Overview of Related Work Classification is one of the most fundamental and important tasks in data mining and machine learning. Many of the researchers performed experiments on medical datasets using decision tree classifiers. Few are summarized here: Page 418
Algorithm CLS IDE IDE3+ C4.5 C 5.0 CART Random Tree Random Forest SLIQ
By observing the above table the frequently used top three decision tree algorithms are ID3, C4.5 and CART. Hence, the experiments are conducted on the above three algorithms. ID3 The ID3 algorithm is considered as a very simple decision tree algorithm developed by Quinlan in 1986[20]. ID3 uses information gain as splitting criteria. The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero. ID3 does not apply any pruning procedures nor does it handle numeric attributes or missing values. It only accepts categorical attributes in tree building. Also does not support noise data. To remove the noise preprocessing technique has used. ID3 algorithm cannot handle the continuous attributes for that discretization is used to convert continuous attributes to categorical attributes. C4.5 C4.5 algorithm is an improvement of IDE3 algorithm, Developed by Quinlan Ross in 1986 [21]. It is based on Hunts algorithm and also like IDE3, it is serially implemented. Pruning takes place in C4.5 by replacing the internal node with a leaf node thereby reducing the error rate. Unlike IDE3, C4.5 accepts both continuous and categorical attributes in building the decision tree. It has an enhanced method of tree pruning that reduces misclassification errors due noise or toomuch details in the training data set. Like IDE3 the data is sorted at every node of the tree in order to determine the best splitting attribute. C4.5 uses gain ratio as an attribute selection measure to build a decision tree. The root node will be the attribute whose gain ratio is very high. C4.5 uses pessimistic pruning for deleting of unnecessary branches in the decision tree due to that accuracy was increased. CART CART (Classification and Regression trees) was introduced by Breiman in 1984 [22]. It builds both classifications and regressions trees. It is also based on Hunts model of Decision tree construction and can be implemented serially. It uses gini index splitting measure in selecting the splitting attribute. Pruning is done in CART by using a portion of the training data set. CART uses both numeric and categorical attributes for building Volume 2, Issue 4 July August 2013
100
Accuracy(%)
3. Experimental Result
The Primary tumor data is collected from UCI machine learning Repository [32] and Colon tumor data is collected from Bioinformatics Group Seville [33], which are publicly available. The results were calculated and analyzed by Weka tool on the data using 10-fold cross validation to test the accuracy and time complexity of ID3, C4.5 and CART algorithms. The following table shows the characteristics of selected tumor datasets. TABLE 2-Characteristics of Data sets Data set No of Attributes No of classes No of instances Missing values Colon Tumor 2001 2 62 No PrimaryTumor 18 2 339 yes
Data sets
Figure 3.1: Comparison of Classifiers Accuracy By observing the above bar diagram, C4.5 algorithm yields better accuracy than CART and ID3. The table-4 shows the time complexity in seconds of various classifiers to build the model for the training data.
If the selected data contains missing values or empty cell entries, it must be preprocessed. For the preprocessing step, replace the values with the corresponding mean of the respective attributes. These datasets contain both continuous and discrete attributes but ID3 algorithm does not support the continuous attributes for that discretization is applied on the data sets. Here unsupervised discritization is used for converting continuous attributes to categorical attributes. The table 3 shows the accuracy of ID3, C4.5 and CART algorithms for classification applied on the above data sets using 10-fold cross validation is as follows: TABLE 3- Correctly classified instances Accuracy (%) Data set ID3 Primary-Tumor Colon-Tumor 34.22 59.68 C4.5 40.12 82.26 CART 39.82 75.81
Data set ID3 Primary-Tumor Colon-Tumor 0.03 6.84 C4.5 0.02 0.31 CART 0.3 1.72
The time complexity to build a decision tree model using ID3, C4.5 and CART classifiers on different tumor data sets is represented in the form of line graph.
8 6 4 2 0 Primary-Tumor Colon-Tumor Data sets ID3 C4.5 CART
Execution time(Sec)
Fig 3.2 Execution time of the data sets Volume 2, Issue 4 July August 2013 Page 421
References
[1] Varun Kumar, Nisha Rathee Knowledge discovery from database using an integration of clustering and classification International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No.3, March 2011, Pg.no:29-33. [2] R. Brachman, T. Khabaza, W.Kloesgan, G.Piatetsky-Shapiro and E. Simoudis, Mining Business Database,Comm. ACM, Vol. 39, no. 11, Pg.no: 42-48, 1996. [3] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, From Data Mining to knowledge Discovery in Database, AI Magazine, vol 17, Pg.no: 37-54, 1996. [4] Fayyad, G. Piatetsky-Shapiro and P. Smyth, From Data Mining to knowledge Discovery in Database, AI Magazine, vol 17, Pg.no: 37-54, 1996. [5] Richard J. Bolton, David J. Hand, Statistical Fraud Detection: A Review, Statist. Sci., Vol. 17, No. 3, Pg.no:235-255, 2002. [6] Antonia Vlahou, John O. Schorge, Betsy W.Gregory and Robert L. Coleman,Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Dat,Journal of Biomedicine and Biotechnology 2003:5(2003) 308314. [7] Kuowj, Chang RF,Chen DR and Lee CC,Data Mining with decision trees for diagnosis of breast tumor in medical ultrasonic image ,March 2001. [8] H. Ren, "Clinical diagnosis of chest pain, Chinese Journal for Clinicians, vol. 36, 2008.International Journal of Computer Applications (0975 8887) Volume 26, No.4, July 2011 [9] My Chau Tu, Dongil Shin, Dongkyoo Shin, A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithm, DASC '09 Proceedings of the 2009 Page 422
From the table 5 it is observed that the accuracy of C4.5 algorithm is high for the Primary-Tumor whereas for the Colon-Tumor both ID3 and C4.5 have equal accuracy but higher than CART for Enhanced datasets. The time complexities to build a decision tree model using ID3, C4.5 and CART Classifiers on enhanced tumor data sets are represented in table 6.
Table 6-Enhanced Datasets-Execution time Execution time (Sec) Data set ID3 Primary-Tumor Colon-Tumor 0.05 0.27 C4.5 0.2 0.47 CART 2.02 2.56
This table shows that time complexity of ID3 algorithm is less to build a model among the three classifiers. Coming to the accuracy, C4.5 and ID3 algorithms exhibit better accuracy than CART algorithm. Accuracy is more important for the classification of tumor data sets. Hence, C4.5 and ID3 both are the best algorithms for finding out whether the tumor is benign or malignant if normal size datasets are used. If the number of instances is double sized then ID3 and C4.5 both algorithms reveal equal accuracy. 4. CONCLUSION Volume 2, Issue 4 July August 2013
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21] [22]
Page 423