Design Classification System for Computer-Aided Diagnosis
in CT Colonography
|
| |
| Authors: |
Haiyong Xu, MS, Virginia Tech - Wake Forest University School of Biomedical Engineering and Sciences; H. Donald Gage, PhD; Christopher L. Wyatt, PhD; Pete Santago, PhD; Yuan Shen
|
| |
| Background: |
Colorectal cancer is the second leading cause of cancer deaths in the United States and is expected to cause about 49,960 deaths during 2008.[1] Due to its invasiveness and perceived discomfort, subjects will often avoid screening for this disease, even though early detection can vastly improve a patient’s prognosis. To address this issue, a non-invasive imaging approach, CT Colonography (CTC), was introduced in the 1990s. One problem with this approach is that a single CTC exam requires a radiologist to review hundreds of CT images, which is both time consuming and prone to human error.[2] To address this issue, computer-aided detection (CAD) schemes have been proposed to assist the radiologist in detecting polyps and therefore decrease the time needed to complete an exam and increase the accuracy of the diagnosis.[3-5] Typical CAD systems first generate a list of polyp candidates using features derived from the image data. This list will include all, or almost all, actual polyps (true positives, high sensitivity) as well as thousands of false positive detections (low specificity). A classifier is then applied to the list, with the goal of reducing the number of false positives without compromising sensitivity. While many classification algorithms have been applied to CTC CAD schemes, there have been no comparisons among them. In this work, we address this issue by evaluating the performance of various classifiers on actual CTC data. Also, we address the imbalance class distribution issue in building classifier for CTC.
|
| |
| Evaluation: |
Data used for the work are from a study conducted at this institution, in which patients underwent both CTC and optical colonoscopy. After polyp candidate generation, the ratio between the number of TP and FP polyp candidates was extremely imbalanced (24,062 polyp candidates were generated with only 31 TP polyps, giving a ratio of TP to FP of 0.0013). Directly training a classifier with this highly imbalanced data would likely result in poor or skewed classification performance. We therefore employed random oversampling and SMOTE (Synthetic Minority Over-sampling Technique) methods, which have been investigated in [6-8], to deal with the class imbalance problem. In both techniques, the TP polyp candidates are over-sampled to match the number of FPs in order to make the class distribution in the oversampled dataset to be 1:1.
After polyp candidate generation, each polyp candidate consisted of 21 features, including seven statistics (mean, max, min, variance, skew, kurtosis, and contrast) for three voxel volumetric features (shape index, curvedness, and directional gradient concentration). In the classification stage, 24,062 polyp candidates (from 57 patients) were used to train and test classifiers which were selected from five classification algorithms: decision trees (C4.5), artificial neural network (ANN), support vector machines (SVMs), bagging, and AdaBoost. For each algorithm, we tuned its parameters in order to find a classifier model, which performed best on our CTC dataset. Cross validation was used to compare each classifier. Specifically, we did a 10-fold cross validation with stratification, and repeated the cross validation 10 times. The mean values of sensitivity, specificity, and area under receiver operating characteristic (ROC) curve (AUC) from the 10 repeated runs were used as criteria to compare classifier models. Table 1 provides a summary of the parameter settings for each classification algorithm. Since each parameter setting corresponds to a distinct classifier model, and we do 10-repeat 10-fold cross-validation, we used 101x10x10=10,100 classifiers in our experiment. We used the machine learning software WEKA, and extended its functionality to make it easier to organize a large scale experiment on our computer cluster environment – WFU DEAC cluster.[9]

Table 1. Parameters for five classification algorithms
In Figure 1, we selected one classifier model with the highest sensitivity from each of five classification algorithms, and presented its sensitivity, specificity, and AUC. In Figures 2 and 3, we presented all classifier models from AdaBoost (taking decision stump as weak learners) and SVMs classification algorithms since both of them performed better than the others. To determine the best performing classifier, we ranked classifiers according to sensitivity, specificity, and AUC. Then we summarized each classifier’s rank in sensitivity and AUC, and regarded the classifier that has the lowest total rank as the best performing model. In our experiment, the AdaBoost model with 25 decision stump weak learners was the best performing model since it had the lowest total rank of 17: #10 in terms of sensitivity and #7 in terms of AUC. This model was trained using the SMOTE method and yielded 0.72 sensitivity, 0.82 specificity, and 0.84 AUC. The FP findings are reduced from 105/dataset to 25/dataset, or from 422/case to 101/case.

Figure 1. Classifier performance for five classification models

Figure 2. Classifier performance for AdaBoost

Figure 3. Classifier performance for SVMs
|
| |
| Discussion: |
From our experiment results, we see that AdaBoost_DS (AdaBoost with decision stump as weak learner) and SVMs perform better than the other classifier models in terms of sensitivity and AUC. The AdaBoost_DS algorithm reveals better performance because of its advantage in FP reduction tasks and its strength in overcoming overfitting problem. In our experiment, we noticed that as the number of weak learners increased, the FP findings decreased dramatically, while the TP findings remained unchanged. Another characteristic of AdaBoost_DS is that it often does not suffer from overfitting problem. SVMs also perform well in our experiment. Given the strong connection between SVMs and AdaBoost algorithms, (SVMs use the $l_2$ norm for both the instance vector and the weight vector, while AdaBoost uses the $l_inf$ norm for the instance vector and $l_1$ norm for the weight vector), it is not surprising to see that SVMs and AdaBoost_DS have similar classification performance.[10] However, the computational requirement for these two algorithms is not similar. In our experiment, 13~70 hours are needed to train a SVMs (RBF kernel and no-linear polynomial kernel). However, all AdaBoost_DS classifiers can be trained within 800 seconds.
In our experiment, decision trees (C4.5) and ANNs were easily overfitted by our CTC data. The AdaBoost and Bagging classifier models, which take these two classifiers as weak learners or component classifiers, also suffered from the overfitting problem.
|
| |
| Conclusion: |
We conducted an experiment to address two classification issues in CTC. We used random oversampling and SMOTE methods to deal with imbalance class distribution issues, and we used repeated stratified cross validation for classifier model selection. We selected models from 5 classification algorithms and found that AdaBoost_DS and SVMs perform better than the others, in terms of sensitivity and AUC. In the future, we will further tune the AdaBoost_DS and SVMs to study performance improvement by changing the parameter settings, apply these two classifier models to other CTC data, and implement an ensemble classification system based on these models.
The research was supported by NIH/NCI grant with grant number 5 R01 CA114492-03. The authors would like to acknowledge the help from Yaorong Ge, Ph.D.
|
| |
| References: |
[1] American Cancer Society, “Colorectal Cancer: Early Detection.” 2008.
[2] Frentz S, Summers R. “Current Status of CT Colonography.” Academic Radiology. December 2006;Vol.13:12:1517-1531.
[3] Summers R, Beaulieu C, Pusanik L, et al. “Automated Polyp Detector for CT Colonography: Feasibility Study.” Radiology. 2000;Vol.216:1:285-290.
[4] Summers R, Johnson D, Pusanik L, Malley J, Youssef A, Reed J. “Automated Polyp Detection at CT Colonography: Feasibility Assessment in a Human Population.” Radiology. April 2001;Vol,219:1:51-59.
[5] Yoshida H, Masutani Y, MacEneaney P, Rubin DT, Dachman AH. “Computerized detection of colonic polyps at CT colonography on the basis of volumetric features: pilot study,” Radiology. February 2002;Vol.222:2:327-336.
[6] Van Hulse J, Khoshgoftaar T, Napolitano A. “Experimental perspectives on learning from imbalanced data.” ICML ’07: Proceedings of the 24th international conference on Machine learning, ACM International Conference Proceeding Series, 2007;Vol.227:935-942.
[7] Japkowicz N, Stephen S. “The class imbalance problem: A systematic study,” Intelligent Data Analysis. 2002;Vol.6:5:429-449.
[8] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. “SMOTE: Synthetic Minority Over-sampling Technique.” Journal of Artificial Intelligence Research, 2002;Vol.16: 321-347
[9] Witten I, Frank E. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems) Morgan Kaufmann Publishers Inc. 2005.
[10] Freund Y, Schapire R. “A short introduction to boosting,” Journal of Japanese Society for Artificial Intelligence. September 1999;Vol.14:5:771-780. |
|
|
|
| |
| |
|