Home
Portfolio
Blog
CV
Say hi!
Home
Portfolio
Blog
CV
Say hi!

Predicting Diabetes Using Tree-based Methods

Thesis for degree of Master (Uppsala University, Department of Statistics)

Abstract

The aim of this study is to develop a statistical model to predict type 2 diabetes based on the tree-based model. Furthermore, the aim to compare classification with current medical criteria. Used 60,318 patient's data with demographic factors and laboratory measurements from MIMIC III database. 12,933 patients are pre-diagnosed as having diabetes and will implement supervised learning based on tree models. Decision Tree, Random Forest, Boosting with a XGBoost algorithm is used as a classification method to predict diabetes. The results show XGBoost outperformed the two other models in yielding highest classification rate, with a 84.6% test accuracy. However, the two other methods also show relevantly high accuracy, which is comparable with the physician's medical approach. Two interesting findings from this paper are: 1) Ensemble methods such as Random forest and boosting can be easily overfitted on training data, but this problem can be solved with correct hyper-parameter tunning. And 2) Tree-based methods such as XGboost and Random Forest can solve variables' multicollinearity problems.

Oops! You don't support PDFs on the device!

Download Instead