Machine Learning Techniques for the Diagnosis of Pediatric Tuberculosis

Coston, Amanda [Browse]
Senior thesis
68 pages


Schapire, Robert [Browse]
Princeton University. Department of Computer Science [Browse]
Class year
Restrictions note
Walk-in Access. This thesis can only be viewed on computer terminals at the Mudd Manuscript Library.
Summary note
The goal of this project was two-fold: first, to improve the performance of machine learning algorithms for the diagnosis of pediatric tuberculosis, and second, to use machine learning algorithms to better understand the problem of diagnosis. We constructed and examined Bayes nets using a MATLAB toolbox by Kevin Murphy and we experimented with 26 other machine learning algorithms in the Weka software package. We found that while the Bayes nets have better accuracy when we initialize parameters based on medical knowledge, creating our own structure based on medical knowledge did not increase performance; a naive Bayes net does better than the our handcrafted Bayes net. Neither the Bayes nets nor any of the Weka algorithms performed at the level necessary for use in real medical settings. Calibration curves show that the predicted probabilities of the Bayes nets and Weka algorithms do not correspond to the probability of positive diagnosis. Among the Weka algorithms, we found that decision algorithms generally have better performance, with the alternating decision tree and the ensemble methods (bagging and Adaboost) on decision stumps performing the best. Overall, false negative rates are much higher than false positive rates, which does not bode well for practical applications since false negatives yield significantly dire consequences in real life. We found that we could lower the false negative rates and generally improve the performance of the Bayes nets by guessing the label of unknown instances, a method we call predictive labeling. Using a variety of algorithms, we also tested for which features were most important to diagnosis. The structure of alternating decision trees as well as traditional decision trees contributed to our understanding. We also randomized the data for each feature to see which had the greatest effect on performance, reasoning that the feature whose randomization had the greatest effect would be the most important. In addition, we implemented an explanation algorithm by selecting which feature in each patient would change the probability of diagnosis most if not present. Using these algorithms we found that the most important features for diagnosis were malaise and weight loss. Moving forward, we recommend obtaining larger and more comprehensive data sets that may yield better performance from the Bayes nets and other machine learning algorithms.

Supplementary Information