- Coston, Amanda [Browse]
- Senior thesis
- 68 pages
- Schapire, Robert [Browse]
- Princeton University. Department of Computer Science [Browse]
- Class year
- Restrictions note
- Walk-in Access. This thesis can only be viewed on computer terminals at the Mudd Manuscript Library.
- Summary note
- The goal of this project was two-fold: first, to improve the performance of machine learning
algorithms for the diagnosis of pediatric tuberculosis, and second, to use machine learning algorithms
to better understand the problem of diagnosis. We constructed and examined Bayes nets
using a MATLAB toolbox by Kevin Murphy and we experimented with 26 other machine learning
algorithms in the Weka software package. We found that while the Bayes nets have better accuracy
when we initialize parameters based on medical knowledge, creating our own structure based on
medical knowledge did not increase performance; a naive Bayes net does better than the our handcrafted
Bayes net. Neither the Bayes nets nor any of the Weka algorithms performed at the level
necessary for use in real medical settings. Calibration curves show that the predicted probabilities
of the Bayes nets and Weka algorithms do not correspond to the probability of positive diagnosis.
Among the Weka algorithms, we found that decision algorithms generally have better performance,
with the alternating decision tree and the ensemble methods (bagging and Adaboost) on decision
stumps performing the best. Overall, false negative rates are much higher than false positive rates,
which does not bode well for practical applications since false negatives yield significantly dire
consequences in real life. We found that we could lower the false negative rates and generally
improve the performance of the Bayes nets by guessing the label of unknown instances, a method
we call predictive labeling.
Using a variety of algorithms, we also tested for which features were most important to diagnosis.
The structure of alternating decision trees as well as traditional decision trees contributed to our
understanding. We also randomized the data for each feature to see which had the greatest effect on
performance, reasoning that the feature whose randomization had the greatest effect would be the
most important. In addition, we implemented an explanation algorithm by selecting which feature in
each patient would change the probability of diagnosis most if not present. Using these algorithms
we found that the most important features for diagnosis were malaise and weight loss.
Moving forward, we recommend obtaining larger and more comprehensive data sets that may
yield better performance from the Bayes nets and other machine learning algorithms.