You are on page 1of 6

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856

Evaluation of Decision Tree Classifiers on Tumor Datasets


G. Sujatha1, Dr. K. Usha Rani2
1

Assistant Professor, Master of Computer Applications Rao & Naidu Engineering College, Ongole Andhra Pradesh, India

2 Associate Professor Department of Computer Science Sri Padmavati Mahila Viswavidyalayam (Womens University), Tirupati Andhra Pradesh, India

Abstract: Classification is playing an important role in the


field of data mining as well as in the studies of machine learning, statistics, neural networks and many expert systems over years. Different classification algorithm has been successfully implemented in various applications. Among them some of the popular implications of classification algorithms are scientific experiments, image processing, fraud detection, medical diagnosis and lots more. In the recent years medical data classification especially tumor data classification caught a huge interest among the researchers. Decision tree classifiers are used extensively for different types of tumor cases. In this paper, performance of decision tree induction algorithms on tumor medical data sets in terms of Accuracy and time complexities are analyzed.

Keywords: Data mining, Classification, Decision trees, Tumor data sets.

defined class label. Classification algorithms have a wide range of applications like churn pre-diction, fraud detection, artificial intelligence, credit card rating, etc., [2], [3], [4]. Also there are many classification algorithms available in literature but decision tree is the most commonly used because of its ease of implementation and easier to understand compared to other classification algorithms. Decision tree classifiers are used extensively for diagnosis of breast tumor, ultrasonic images, ovarian cancer, heart sound diagnosis, etc., [5]-[10]. Decision Tree classification algorithm can be implemented in a serial or parallel fashion based on the volume of data, memory space available on the computer resource and scalability of the algorithm. In this study the experiments are conducted and analyzed the accuracy evaluation of commonly used decision tree algorithms on two tumor data sets. Decision trees play a vital role in the field of medical diagnosis to diagnose the problem of a patient. In this paper, accuracy of various decision tree classifiers and their time complexity are compared on Tumor Data sets. The rest of the paper is organized as follows. In section 2, theory and the review of decision tree induction algorithms, overview of related work and introduction of data sets are presented. The experimental results and the performance of most frequently used decision tree classifiers with comparison are presented in section 3 and conclusion in section 4.

1. Introduction
Tumor is abnormal cell growth that can be either benign or malignant. Benign tumors are non invasive while malignant tumors are cancerous and spread to other part of the body. Early diagnosis and treatment helps to prevent the spread of tumor. Data Mining is a convenient way of extracting patterns, which represents knowledge implicitly stored in datasets and focuses on issues relating to their feasibility, usefulness, effectiveness and scalability. Data are preprocessed by data cleaning, data integration, data selection and data transformation. Data mining functionalities are Classification, Association, Correlation analysis, Prediction, cluster analysis, etc. Classification is a fundamental task in data mining [1]. Classification is done through grouping of similar data objects together. It can be defined as supervised learning algorithm as it assigns class labels to data objects based on the relationship between the data items with a preVolume 2, Issue 4 July August 2013

2. Background
2.1 Overview of Related Work Classification is one of the most fundamental and important tasks in data mining and machine learning. Many of the researchers performed experiments on medical datasets using decision tree classifiers. Few are summarized here: Page 418

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
In [11], Aruna Sundaram et.al., experimented on selecting of predictive genes for effectual cancer classification using Hybrid Statistical Pattern Recognition (Hybrid SPR) algorithm. They proved that Data mining algorithms Simple CART, RBF Network, Naive bayes and J48 were used to classify the colon cancer with marker genes selected by the algorithm. The gene subset improved the predictive accuracy of all the classifiers. In this work, the algorithm was experimented over colon cancer data set. In the study [12] the authors Srinivas Mukkamata et.al, presented that Computational intelligent techniques that can be useful at the diagnosis stage to assist the Oncologist in identifying the malignancy of a tumor. In this paper they perform a t-test for significant gene expression analysis in different dimensions based on molecular profiles from micro array data, and compared several computational intelligent techniques for classification accuracy on selected datasets. For finding accuracy of classification Linear genetic Programs, Multivariate Regression Spines (MARS), Classification and Regression Tress (CART) and Random Forests are used. In [13], Krzystof Fujarewicz et.al, explored that the use of Recursive Feature Selection (RFS) method for finding suboptimal gene subsets for tumor tissue classification. They found that RFS method is able to find the smallest gene subset that gives no misclassification in leave-oneout cross-validation for tumor colon data set. The authors Aruna et.al, [14] presented a comparison of classification algorithms on the Wisconsin Breast Cancer and Breast tissue datasets but has not provided feature selection as a pre-classification condition. Moreover they have analyzed the classification results of only five classification algorithms namely Naive Bayes, Support Vector Machines (SVM), Radial Basis Neural Networks (RB-NN), Decision trees J48 and simple CART. In [15], Luxmi et.al, have performed a comparative study on the performance of binary classifiers. They have used the Wisconsin breast cancer dataset with 10 attributes and not the breast tissue dataset. Moreover they have not brought out the effect of feature selection in classification. Their experimental study was restricted to four classification algorithms viz. ID3, C4.5, K Nearest Neighbors (K-NN) and Support Vector Machines(SVM). Their results did not reveal complete accuracy for any of the classification algorithms. In [16], the authors D.Lavanya et.al., analyzed the performance of decision tree classifiers on various medical datasets in terms of accuracy and time complexity and proved that CART is the best. In [17], Bijan Moghimi-Dehkordi, et.al, explored about colorectal cancer survival rates and prognosis in Asia. They proved that colorectal cancer survival time has increased in the past decades, but mortality rate remains higher than before. 2.2 Decision Tree Decision tree algorithm is a data mining induction technique that recursively partitions a data set of records using depth-first greedy approach or breadth-first approach until all the data items belong to a particular class. A decision tree structure is made of root, internal and leaf nodes. The tree structure is used in classifying unknown data records. At each internal node of the tree, a decision of best split is made using impurity measures. Decision tree classification technique is performed in two phases [18]: Tree building and Tree pruning. Tree building is done in top-down manner. During this phase that the tree is recursively partitioned till all the data items belong to the same class label. It is very tedious tasking and computationally intensive as the training data set is traversed repeatedly. Tree pruning is done in bottom-up fashion. It is used to improve the prediction and classification accuracy of the algorithm by minimizing over-fitting (noise or much detail in the training data set). Over-fitting in decision tree algorithm is the cause of misclassification error. Tree pruning is done in 2 ways. Post pruning and Pre pruning. Post pruning means take a fully grown tree and discard unreliable parts. Pre pruning means stop growing a branch when information becomes unreliable. The table specified below represents the usage frequency of various decision tree algorithms [19]. TABLE 1-Frequency usage of Decision tree algorithms Usage Frequency(%) 9 68 4.5 54.55 9 40.9 4.5 9 27.27 Page 419

Algorithm CLS IDE IDE3+ C4.5 C 5.0 CART Random Tree Random Forest SLIQ

Volume 2, Issue 4 July August 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
PUBLIC OCI CLOUDS SPRINT 13.6 4.5 4.5 31.84 the decision tree and has in-built features that deal with missing attributes. CART is unique from other Hunts based algorithms as it is also used for regression analysis with the help of the regression trees. The regression analysis feature is used in forecasting a dependent variable given a set of predictor variables over a given period of time. The CART approach is an alternative to the traditional methods for prediction [23], [24], [25]. In the implementation of CART, the dataset is split into the two subgroups that are the most different with respect to the outcome. This procedure is continued on each subgroup until some minimum subgroup size is reached. 2.3 Brief Description of data sets A. Primary-Tumor [26] A primary tumor refers to a tumor or mass that is growing in the location where cancer originated. For instance, if a patient is diagnosed with stomach cancer the primary tumor would be found in the stomach itself rather than elsewhere in the body. The primary tumor is generally the easiest to remove; however, its removal does not necessarily mean that the patient is cancer-free. When cancer develops, mutated cells grow out of control in a particular area of the body. They grow so fast that they often form a cluster or mass in the area in which they originated. This mass eventually grows large enough to be seen by the naked eye or picked up on via an ultrasound or other diagnostic tool. Generally, the mass that is first noticed by the patient or his doctors is the primary tumor. The first step in cancer treatment often involves the removal of the primary tumor, although this does not guarantee a recovery. B. Colon -Tumor [27] A colon tumor is an abnormal growth of cells found in the colon and can be an indication of colon cancer. If the colon tumor spreads to the bottom part of the colon, also known as the rectum, it can be an indication of colorectal cancer. As the image below shows, the colon is the large intestine or large bowel. The rectum is the passageway that connects the colon to the anus. Some colon tumors are non-cancerous and are called benign polyps. Since benign polyps do not cause colon cancer, they are not dangerous, but if they are not identified and removed, they can change into cancerous tumors. Benign polyps are identified and removed through a procedure called a colonoscopy. Colorectal cancer (cancer of the colon or rectum) is the third most commonly diagnosed cancer in males and the Page 420

By observing the above table the frequently used top three decision tree algorithms are ID3, C4.5 and CART. Hence, the experiments are conducted on the above three algorithms. ID3 The ID3 algorithm is considered as a very simple decision tree algorithm developed by Quinlan in 1986[20]. ID3 uses information gain as splitting criteria. The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero. ID3 does not apply any pruning procedures nor does it handle numeric attributes or missing values. It only accepts categorical attributes in tree building. Also does not support noise data. To remove the noise preprocessing technique has used. ID3 algorithm cannot handle the continuous attributes for that discretization is used to convert continuous attributes to categorical attributes. C4.5 C4.5 algorithm is an improvement of IDE3 algorithm, Developed by Quinlan Ross in 1986 [21]. It is based on Hunts algorithm and also like IDE3, it is serially implemented. Pruning takes place in C4.5 by replacing the internal node with a leaf node thereby reducing the error rate. Unlike IDE3, C4.5 accepts both continuous and categorical attributes in building the decision tree. It has an enhanced method of tree pruning that reduces misclassification errors due noise or toomuch details in the training data set. Like IDE3 the data is sorted at every node of the tree in order to determine the best splitting attribute. C4.5 uses gain ratio as an attribute selection measure to build a decision tree. The root node will be the attribute whose gain ratio is very high. C4.5 uses pessimistic pruning for deleting of unnecessary branches in the decision tree due to that accuracy was increased. CART CART (Classification and Regression trees) was introduced by Breiman in 1984 [22]. It builds both classifications and regressions trees. It is also based on Hunts model of Decision tree construction and can be implemented serially. It uses gini index splitting measure in selecting the splitting attribute. Pruning is done in CART by using a portion of the training data set. CART uses both numeric and categorical attributes for building Volume 2, Issue 4 July August 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
second in females, with over 1.2 million new cancer cases and 608,700 deaths estimated to have occurred in 2008 [28].The highest incidence rates are found in Australia and New Zealand, Europe, and North America, where as the lowest rates are found in Africa and South-Central Asia [29], [30], [31]. The exact cause of colorectal cancer is unknown; in fact it is thought that there is not one single cause. It is more likely that a number of factors, some known and many unknown, may work together to trigger the development of colorectal cancer. From the table 3 it is clearly noticed that C4.5 furnish better result for the two tumor data sets. The classifiers accuracy on two datasets is represented in the form of a bar graph.

100
Accuracy(%)

80 60 40 20 0 Primary Tumor Colon-Tumor ID3 C4.5 CART

3. Experimental Result
The Primary tumor data is collected from UCI machine learning Repository [32] and Colon tumor data is collected from Bioinformatics Group Seville [33], which are publicly available. The results were calculated and analyzed by Weka tool on the data using 10-fold cross validation to test the accuracy and time complexity of ID3, C4.5 and CART algorithms. The following table shows the characteristics of selected tumor datasets. TABLE 2-Characteristics of Data sets Data set No of Attributes No of classes No of instances Missing values Colon Tumor 2001 2 62 No PrimaryTumor 18 2 339 yes

Data sets

Figure 3.1: Comparison of Classifiers Accuracy By observing the above bar diagram, C4.5 algorithm yields better accuracy than CART and ID3. The table-4 shows the time complexity in seconds of various classifiers to build the model for the training data.

TABLE 4- Execution Time to Build the Model Execution time (Sec)

If the selected data contains missing values or empty cell entries, it must be preprocessed. For the preprocessing step, replace the values with the corresponding mean of the respective attributes. These datasets contain both continuous and discrete attributes but ID3 algorithm does not support the continuous attributes for that discretization is applied on the data sets. Here unsupervised discritization is used for converting continuous attributes to categorical attributes. The table 3 shows the accuracy of ID3, C4.5 and CART algorithms for classification applied on the above data sets using 10-fold cross validation is as follows: TABLE 3- Correctly classified instances Accuracy (%) Data set ID3 Primary-Tumor Colon-Tumor 34.22 59.68 C4.5 40.12 82.26 CART 39.82 75.81

Data set ID3 Primary-Tumor Colon-Tumor 0.03 6.84 C4.5 0.02 0.31 CART 0.3 1.72

The time complexity to build a decision tree model using ID3, C4.5 and CART classifiers on different tumor data sets is represented in the form of line graph.
8 6 4 2 0 Primary-Tumor Colon-Tumor Data sets ID3 C4.5 CART

Execution time(Sec)

Fig 3.2 Execution time of the data sets Volume 2, Issue 4 July August 2013 Page 421

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
By observing this diagram the time complexity of C4.5 algorithm is very less among three classifiers. For the two data sets the accuracy and time complexity of C4.5 algorithm is better compared to ID3 and CART algorithms. To observe the performances of classifiers on enhanced data sets, number of instances are doubled and considered for experiment. The performance in terms of accuracy and time complexity are presented in Table 5 and Table 6. Table 5-Enhanced Datasets-Accuracy Accuracy (%) Data set ID3 Primary-Tumor Colon-Tumor 77.29 95.16 C4.5 83.06 95.16 CART 82.34 91.94 Data Mining is used in all most all applications. One of the data mining techniques is classification and it is used accurately and efficiently to classify the data. Decision tree classifiers are so popular for understanding and very easy for analysis. Frequently used classifiers are ID3, C4.5 and CART. These experiments are conducted on those classifiers for better accuracy and execution time to construct the tree. It is observed that C4.5 performs well for tumor datasets, if available datasets are used as it is. Among these three algorithms, C4.5 itself is the best one for enhanced data set of Primary tumor and for enhanced Colon tumor data set both ID3 and C4.5 exhibit equal classification accuracy. So, in future we are paying attention to perform the experiments with ensemble technique on the specified decision tree classifiers for further analysis.

References
[1] Varun Kumar, Nisha Rathee Knowledge discovery from database using an integration of clustering and classification International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No.3, March 2011, Pg.no:29-33. [2] R. Brachman, T. Khabaza, W.Kloesgan, G.Piatetsky-Shapiro and E. Simoudis, Mining Business Database,Comm. ACM, Vol. 39, no. 11, Pg.no: 42-48, 1996. [3] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, From Data Mining to knowledge Discovery in Database, AI Magazine, vol 17, Pg.no: 37-54, 1996. [4] Fayyad, G. Piatetsky-Shapiro and P. Smyth, From Data Mining to knowledge Discovery in Database, AI Magazine, vol 17, Pg.no: 37-54, 1996. [5] Richard J. Bolton, David J. Hand, Statistical Fraud Detection: A Review, Statist. Sci., Vol. 17, No. 3, Pg.no:235-255, 2002. [6] Antonia Vlahou, John O. Schorge, Betsy W.Gregory and Robert L. Coleman,Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Dat,Journal of Biomedicine and Biotechnology 2003:5(2003) 308314. [7] Kuowj, Chang RF,Chen DR and Lee CC,Data Mining with decision trees for diagnosis of breast tumor in medical ultrasonic image ,March 2001. [8] H. Ren, "Clinical diagnosis of chest pain, Chinese Journal for Clinicians, vol. 36, 2008.International Journal of Computer Applications (0975 8887) Volume 26, No.4, July 2011 [9] My Chau Tu, Dongil Shin, Dongkyoo Shin, A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithm, DASC '09 Proceedings of the 2009 Page 422

From the table 5 it is observed that the accuracy of C4.5 algorithm is high for the Primary-Tumor whereas for the Colon-Tumor both ID3 and C4.5 have equal accuracy but higher than CART for Enhanced datasets. The time complexities to build a decision tree model using ID3, C4.5 and CART Classifiers on enhanced tumor data sets are represented in table 6.

Table 6-Enhanced Datasets-Execution time Execution time (Sec) Data set ID3 Primary-Tumor Colon-Tumor 0.05 0.27 C4.5 0.2 0.47 CART 2.02 2.56

This table shows that time complexity of ID3 algorithm is less to build a model among the three classifiers. Coming to the accuracy, C4.5 and ID3 algorithms exhibit better accuracy than CART algorithm. Accuracy is more important for the classification of tumor data sets. Hence, C4.5 and ID3 both are the best algorithms for finding out whether the tumor is benign or malignant if normal size datasets are used. If the number of instances is double sized then ID3 and C4.5 both algorithms reveal equal accuracy. 4. CONCLUSION Volume 2, Issue 4 July August 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing, IEEE Computer Society Washington, DC,USA 2009. Sung Ho Ha and Seong Hyeon Joo, A Hybrid Data Mining Method for the Medical Classification of Chest Pain, World Academy of Science, Engineering and Technology 70 2010. Matthew N.Anyanwu, Sajjan G.Shiva, Comparative Analysis of Serial Decision Tree Classification Algorithms, International journal of Computer science and Security, volume 3. Aruna sundaram, Hybrid SPR algorithm to select predictive genes for effectual cancer classification, 2010 Mathematics Subject Classification: 68T10, 68T05, 92B99. Srinivas Mukkamata,Qing Zhang Liu, Rajeev Verraghattam, Andrew H.sung, Computational Intelligent Techniques for Tumor Classification (Ubibs Microarray Gene Expression Data) Dept of Computer Science, New Maxico Tech, Socorro NM , USA 2002 Krzysztof Fujarewicz, Malgorzata Wiench, Selecting differentially expressed genes for colon tumor classification int.j.Appl.Math.Comput.Sci, 2003. Vol.3, No.3, Pg.no:327-335. S.Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore, 2011 Knowledge Based Analysis Of Various Statistical Tools In Detecting Breast Cancer. Luxmi Verma, Dr.Varun Kumar, Binary Classifiers for Health Care Databases: A ComparativeStudy of Data Mining Classification Algorithms in the Diagnosis of Breast Cancer, IJCST, Vol 1, Issue 2, 2011. D.Lavanya, Dr.K.Usha Rani, Performance Evaluation of Decision Tree Classifiers on Medical Datasets. International Journal of Computer Applications 26(4):1-4, July 2011. Bijan Moghimi-Dehkordi,Azadeh Safaee ,An overview of colorectal cancer survival rates and prognosis in Asia ,World Gastrointest Oncol 2012 April15;4(4):Pg.no:71-75 J. Han and M. Kamber, Data Mining; Concepts and Techniques, Morgan Kaufmann Publishers, 2000. G Stasis, A.C. Loukis, E.N. Pavlopoulos, S.A. Koutsouris, D. Using decision tree algorithms as a basis for a heart sound diagnosis decision support system, Information Technology Applications in Biomedicine, 2003. 4th International IEEE EMBS Special Topic Conference, April 2003. J.R.Quinlan, Induction of decision tree. Journal of Machine Learning 1, 1986, Pg.no:81-106. J.R.Quinlan,c4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc, 1992. [23] Breiman, Friedman, Olshen, and Stone. Classification and Regression Trees, Wadsworth, 1984. Mezzovico, Switzerland. [24] L. Breiman, J. Friedman., R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA. [25] D.Steinberg., and P.L.Colla, CART: TreeStructured Nonparametric Data Analysis, Salford Systems: SanDiego, CA. [26] D.Steinberg., and P.L.Colla, CART-Classification and Regression Trees, Salford Systems: San Diego, CA. [27] Primary tumor, http://www.wisegeek.com/whatis-a-primary-tumor.htm [28] Colon Tumor, http://www.wisegeek.com/what-isa-colon-tumor.htm [29] Emal A, Siegel R, Ward E, Hao Y, Xu J, et al. (2008) Cancer statistics, 2008. CA Cancer J Clin 58: Pg.no:7196. [30] A. Notterman Daniel, Uri Alon, Alexander J. Sierk, Arnold J. Levine, "Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays", Cancer Research, vol. 61, no. 7, Pg.no:3124-3130, 2001. [31] Desai Monica Dandona, Bikramajit Singh Saroya, Albert Craig Lockhart, "Investigational therapies targeting the ErbB (EGFR, HER2, HER3, HER4) family in GI cancers", Expert opinion on investigational drugs, vol. 0, Pg.no:1-16, 2013. [32] Penninx, Brenda WJH, Jack M. Guralnik, Richard J. Havlik, Marco Pahor, Luigi Ferrucci, James R. Cerhan, Robert B. Wallace, "Chronically depressed mood and cancer risk in older persons", Journal of the National Cancer Institute, vol. 90, no. 24, Pg.no: 1888-1893, 1998. [33] UCI Machine Learning Repository, www.ics.uci.edu/~mlearn/MLRepository.html [34] Bioinformatic Group Seville, http://www.upo.es/eps/bigs/datasets.html

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21] [22]

Volume 2, Issue 4 July August 2013

Page 423

You might also like