Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression
Abstract
:1. Introduction
- RQ1.
- Which questions in PHQ-9 were involved in the depression prediction performance of university students who responded to this questionnaire? Is it necessary to use all 9 questions?
- RQ2.
- Which oversampling or undersampling techniques should be used to achieve better depression prediction performance of university students who responded to this questionnaire?
- RQ3.
- Is using only one dataset of depression enough to predict university student depression effectively?
2. Materials and Proposed Method
2.1. Feature Selection
2.1.1. Chi-Square
2.1.2. Gini Index
2.1.3. Information Gain
2.2. Proposed Method
- The depression data were divided into training data, for generating the prediction model for depression, and unseen testing data, for evaluating the performance of the model for prediction. For training data, 10-fold cross-validation was used to divide the data into 10 equally sized sets. Each set was in turn used as the test set, while the random forest classifier was trained on the other nine sets [28].
- Chi-square, Gini index, and information gain were used to extract the principle feature items. These techniques were selected to automatically identify meaningful smaller subsets of feature items and still exhibit high performance in terms of the prediction of depression [29].
- 3.
- The reduced feature items of training data were obtained by oversampling the minority class using (1) random over-sampling, (2) SMOTE, and (3) Borderline-SMOTE. Then, the technique that balanced the data with the best performance in terms of prediction was selected.
- 4.
- Moreover, the reduced feature items of training data were also undersampled by (1) Tomek links, (2) edited nearest neighbors, and (3) repeated edited nearest neighbors. Then, the undersampling technique with the best performance in terms of prediction was selected.
- 5.
- The best-performing oversampling and undersampling techniques from steps 3 and 4 were used as a hybrid sampling approach. First, the random oversampling method was applied to generate new samples by randomly sampling the current training data with replacement. The minority class was oversampled, and a new balanced training dataset was created. Then, the Tomek links undersampling method was applied to reduce the number of each class by removing unwanted overlap between classes. This procedure was based on the assumption that random oversampling with replacement in the minority class was used for balancing data. Then, Tomek links were used to remove the majority data closest to real or synthetic data in the minority class. Thus, the majority samples were deleted until a better boundary was provided for classifier decisions.
- 6.
- The new training data from step 5 were used to generate the depression prediction model with the random forest classifier.
- 7.
- Then, the unseen testing data were used to evaluate the performance of the predicted depression model.
3. Results and Performance Evaluation
3.1. Dataset Description
3.2. Ethical Consideration
3.3. Assessment Matrices
3.4. Results
4. Deployment and Discussion
4.1. Research Questions and Proposed Solutions
4.1.1. RQ1. Which Questions in PHQ-9 Were Involved in the Depression Prediction Performance of University Students Who Responded to this Questionnaire? Is it Necessary to Use all 9 Questions?
4.1.2. RQ2. Which Oversampling or Undersampling Techniques Should Be Used to Achieve Better Depression Prediction Performance of University Students Who Responded to This Questionnaire?
4.1.3. RQ3. Is Using Only One Dataset of Depression Enough to Predict University Student Depression Effectively?
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ebert, D.D.; Buntrock, C.; Mortier, P.; Auerbach, R.; Weisel, K.K.; Kessler, R.C.; Cuijpers, P.; Green, J.G.; Kiekens, G.; Nock, M.K.; et al. Prediction of major depressive disorder onset in college students. Anxiety Depress. Assoc. Am. 2019, 36, 294–304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jatchavala, C.; Chan, S. Thai Adolescent Depression: Recurrence Prevention in Practice. J. Health Sci. Med. Res. 2018, 36, 147–155. [Google Scholar] [CrossRef]
- Kroenke, K.; Spitzer, R.L.; Williams, J.B.W. The PHQ-9 Validity of a Brief Depression Severity Measure. J. Gen Int. Med. 2001, 16, 606–613. [Google Scholar] [CrossRef] [PubMed]
- Al-Busaidi, Z.; Bhargava, K.; Al-Ismaily, A.; Al-Lawati, H.; Al-Kindi, R.; Al-Shafaee, M.; Al-Maniri, A. Prevalence of Depressive Symptoms among University Students in Oman. Oman Med. J. 2011, 26, 235–239. [Google Scholar] [CrossRef] [PubMed]
- Levis, B.; Benedetti, A.; Thombs, B.D. Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: Individual participant data meta-analysis. BMJ Clin. Res. 2019, 365, 1–11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chang, Y.S.; Hung, W.C.; Juang, T.Y. Depression diagnosis based on ontologies and Bayesian networks. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 13–16 October 2013; pp. 3452–3457. [Google Scholar]
- Thanathamathee, P. Boosting with feature selection technique for screening and predicting adolescents depression. In Proceeding of the Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP), Bangkok, Thailand, 6–8 May 2014; pp. 23–27. [Google Scholar]
- Ghafoor, Y.; Huang, Y.P.; Lui, S.I. An intelligent approach to discovering common symptoms among depressed patients. In Soft Computing—A Fusion of Foundations, Methodologies and Applications; Springer: Cham, Switzerland, 2015; Volume 19, pp. 819–824. [Google Scholar]
- Hou, Y.; Xu, J.; Huang, Y.; Ma, X. A big data application to predict depression in the university based on the reading habits. In Proceeding of the 3rd International Conference on Systems and Informatics (ICSAI), Shanghai, China, 19–21 November 2016; pp. 1085–1089. [Google Scholar]
- Li, X.; Hu, B.; Sun, S.; Cai, H. EEG-based mild depressive detection using feature selection methods and classifiers. In Computer Methods and Programs in Biomedicine; Elsevier: New York, NY, USA, 2016; Volume 136, pp. 151–161. [Google Scholar]
- Kim, J.Y.; Liu, N.; Tan, X.T.; Chu, C.H. Unobtrusive Monitoring to Detect Depression for Elderly with Chronic Illnesses. IEEE Sens. J. 2017, 17, 5694–5704. [Google Scholar] [CrossRef]
- Gao, S.; Calhoun, V.D.; Sui, J. Machine learning in major depression: From classification to treatment outcome prediction. CNS Neurosci. Ther. 2018, 24, 1037–1052. [Google Scholar] [CrossRef] [Green Version]
- Alzyoud, S.; Kharabsheh, M.; Mudallal, R. Predicting Depression Level of Youth Smokers Using Machine Learning. Int. J. Sci. Technol. Res. 2019, 8, 2245–2248. [Google Scholar]
- Blessie, E.C.; George, B. A Novel approach for Psychiatric Patient Detection and Prediction using Data Mining Techniques. Int. J. Eng. Res. Technol. 2019, 7, 1–4. [Google Scholar]
- Priya, A.; Garg, S.; Tigga, N.P. Predicting Anxiety, Depression and Stress in Modern Life using Machine Learning Algorithms. Proced. Comput. Sci. 2020, 167, 1258–1267. [Google Scholar] [CrossRef]
- Khalilia, M.; Chakraborty, S.; Popescu, M. Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 2011, 11, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Thanathamathee, P.; Lursinsap, C. Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognit. Lett. 2013, 34, 1339–1347. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.; Mao, B. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
- Tomek, I. Two modifications of CNN. In IEEE Transactions on Systems, Man, and Cybernetics; IEEE: New York, NY, USA, 1976; Volume 6, pp. 769–772. [Google Scholar]
- Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. In IEEE Transactions on Systems, Man, and Cybernetics; IEEE: New York, NY, USA, 1972; Volume 2, pp. 408–421. [Google Scholar]
- Tomek, I. An Experiment with the Edited Nearest-Neighbor Rule. In IEEE Transactions on Systems, Man, and Cybernetics; IEEE: New York, NY, USA, 1976; Volume 6, pp. 448–452. [Google Scholar]
- Bektas, J.; Ibrikci, T.; Ozcan, I.T. Classification of Real Imbalanced Cardiovascular Data Using Feature Selection and Sampling Methods: A Case Study with Neural Networks and Logistic Regression. Int. J. Artif. Intell. Tools 2017, 26, 1750019. [Google Scholar] [CrossRef]
- Sharma, A.; Verbeke, W.J.M.I. Improving Diagnosis of Depression with XGBOOST Machine Learning Model and a Large Biomarkers Dutch Dataset. Front. Big Data 2020, 3, 1–11. [Google Scholar] [CrossRef]
- Zhai, Y.; Song, W.; Liu, X.; Lui, L.Z.; Zhao, X. A Chi-Square Statistics Based Feature Selection Method in Text Classification. In Proceedings of the IEEE 9th International Conference on Software Engineering and Service Science (ICSESS 2018), Beijing, China, 23–25 November 2018; pp. 160–163. [Google Scholar]
- Park, H.; Kwon, H.C. Improve Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification. IEICE Trans. Inf. Syst. 2011, E94.D, 855–865. [Google Scholar] [CrossRef] [Green Version]
- Dhir, C.S.; Iqbal, N.; Lee, S.Y. Efficient feature selection based on information gain criterion for face recognition. In Proceedings of the International Conference on Information Acquisition, Jeju, Korea, 8–11 July 2007; pp. 523–527. [Google Scholar]
- Guo, H.; Viktor, H. Learning from imbalanced data sets with boosting and data generation: The databoost-im approach. SIGKDD Explor. 2004, 6, 30–39. [Google Scholar] [CrossRef]
- Sirisathitkul, Y.; Thanathamathee, P.; Aekwarangkoon, S. Predictive Apriori Algorithm in Youth Suicide Prevention by Screening Depressive Symptoms from Patient Health Questionnaire-9. TEM J. 2019, 8, 1449–1455. [Google Scholar]
- Kaur, A.; Kaur, I. An empirical evaluation of classification algorithms for fault prediction in open source projects. J. King Saud Univ. Comput. Inform. Sci. 2018, 30, 2–17. [Google Scholar] [CrossRef] [Green Version]
- Chen, S.; He, H.; Garcia, E. Ramoboost: Ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 2010, 21, 1624–1642. [Google Scholar] [CrossRef]
- Lemaitre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 17. [Google Scholar]
- Batista, G.; Prati, R.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing machine Learning Training Data. SIGKDD Explor. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Lotrakul, M.; Sumrithe, S.; Saipanish, R. Reliability and validity of the Thai version of the PHQ-9. BMC Psychiatry 2008, 8, 1–7. [Google Scholar] [CrossRef] [Green Version]
- Prati, R.C.; Batista, G.E.A.P.A.; Monard, M.C. Data mining with imbalanced class distributions: Concepts and methods. In Proceedings of the 4th Indian International Conference on Artificial Intelligence (IICAI-09), Tumkur, India, 16–18 December 2009; pp. 359–376. [Google Scholar]
- Rice, F.; Riglin, L.; Lomax, T.; Souter, E.; Potter, R.; Smith, D.J.; Thapar, A.K.; Thapar, A. Adolescent and adult differences in major depression symptom profiles. J. Affect. Disord. 2019, 243, 175–181. [Google Scholar] [CrossRef] [Green Version]
Question No. | During the Past 2 Weeks Including Today, How Often Have You Had These Symptoms? |
---|---|
Q1 | Little interest or pleasure in doing things |
Q2 | Feeling down, depressed, or hopeless |
Q3 | Trouble falling asleep, staying asleep, or sleeping too much |
Q4 | Feeling tired or having little energy |
Q5 | Poor appetite or overeating |
Q6 | Feeling bad about yourself or that you are a failure or have let yourself or your family down |
Q7 | Trouble concentrating on things, such as reading the newspaper or watching television |
Q8 | Moving or speaking so slowly that other people could have noticed. Alternatively, being so fidgety or restless that you have been moving around a lot more than usual |
Q9 | Thoughts that you would be better off dead or of hurting yourself in some way |
Depression Symptoms | PHQ-9 Score | N | Proportion of Each Class (%) |
---|---|---|---|
No symptoms | 0–6 | 821 | 53.04 |
Mild | 7–12 | 567 | 36.63 |
Moderate | 13–18 | 130 | 8.39 |
Severe | 19–27 | 30 | 1.94 |
Actual Positive (P) | Actual Negative (N) | |
---|---|---|
Predicted Positive (Y) | TP (true positives) | FP (false positives) |
Predicted Negative (N) | FN (false negatives) | TN (true negatives) |
Chi-Square | Gini Index | Information Gain | |||
---|---|---|---|---|---|
Feature Item | Weight | Feature Item | Weight | Feature Item | Weight |
Q8 | 1 | Q4 | 1 | Q4 | 1 |
Q6 | 0.932 | Q8 | 0.978 | Q6 | 0.926 |
Q4 | 0.891 | Q6 | 0.955 | Q7 | 0.907 |
Q7 | 0.881 | Q7 | 0.95 | Q8 | 0.852 |
Q5 | 0.474 | Q2 | 0.715 | Q2 | 0.714 |
Q2 | 0.437 | Q3 | 0.601 | Q3 | 0.597 |
Q3 | 0.227 | Q5 | 0.508 | Q5 | 0.534 |
Q1 | 0.158 | Q1 | 0.247 | Q1 | 0.372 |
Q9 | 0 | Q9 | 0 | Q9 | 0 |
Technique | Class | Accuracy (%) | Precision (%) | Recall (%) | F-Measure (%) |
---|---|---|---|---|---|
Random forest without sampling | 0 | 91.26% | 96.39% | 97.56% | 96.97% |
1 | 86.66% | 94.69% | 90.50% | ||
2 | 84.62% | 42.31% | 56.41% | ||
3 | 57.14% | 66.67% | 61.54% |
Technique | Class | Accuracy (%) | Precision (%) | Recall (%) | F-Measure (%) |
---|---|---|---|---|---|
Random forest with random oversampling | 0 | 93.53% | 97.53% | 98.17% | 97.85% |
1 | 88.00% | 96.76% | 92.17% | ||
2 | 86.08% | 73.76% | 79.45% | ||
3 | 83.33% | 83.33% | 83.33% | ||
Random Forest with SMOTE | 0 | 92.56% | 97.53% | 96.34% | 96.93% |
1 | 90.43% | 92.04% | 91.23% | ||
2 | 76.00% | 73.08% | 74.51% | ||
3 | 71.43% | 83.33% | 76.92% | ||
Random Forest with Borderline-SMOTE | 0 | 91.59% | 97.50% | 95.12% | 96.30% |
1 | 87.60% | 93.81% | 90.60% | ||
2 | 77.27% | 65.38% | 70.83% | ||
3 | 66.67% | 66.67% | 66.67% |
Technique | Class | Accuracy (%) | Precision (%) | Recall (%) | F-Measure (%) |
---|---|---|---|---|---|
Random forest with Tomek links | 0 | 91.59% | 96.41% | 98.17% | 97.28% |
1 | 87.70% | 94.69% | 91.06% | ||
2 | 84.62% | 42.31% | 56.41% | ||
3 | 57.14% | 66.67% | 61.54% | ||
Random forest with edited nearest neighbors | 0 | 88.35% | 94.12% | 97.56% | 95.81% |
1 | 81.75% | 91.15% | 86.19% | ||
2 | 100.00% | 19.23% | 32.26% | ||
3 | 71.43% | 83.33% | 76.92% | ||
Random forest with repeated edited nearest neighbors | 0 | 87.70% | 93.02% | 97.56% | 95.24% |
1 | 81.60% | 90.27% | 85.72% | ||
2 | 100.00% | 15.38% | 26.66% | ||
3 | 62.50% | 83.33% | 71.43% |
Depression Symptom | Number of Data | Fraction (%) |
---|---|---|
No symptoms | 658 | 25.02% |
Mild | 658 | 25.02% |
Moderate | 658 | 25.02% |
Severe | 656 | 24.94% |
Technique | Class | Accuracy (%) | Precision (%) | Recall (%) | F-Measure (%) |
---|---|---|---|---|---|
Random forest with random oversampling | 0 | 93.53% | 97.53% | 98.17% | 97.85% |
1 | 88.00% | 96.76% | 92.17% | ||
2 | 86.08% | 73.76% | 79.45% | ||
3 | 83.33% | 83.33% | 83.33% | ||
Random forest with Tomek links | 0 | 91.59% | 96.41% | 98.17% | 97.28% |
1 | 87.70% | 94.69% | 91.06% | ||
2 | 84.62% | 42.31% | 56.41% | ||
3 | 57.14% | 66.67% | 61.54% | ||
Random forest combined with random oversampling and Tomek links | 0 | 94.17% | 98.17% | 98.17% | 98.17% |
1 | 95.58% | 97.35% | 96.46% | ||
2 | 87.62% | 89.47% | 88.54% | ||
3 | 83.33% | 83.33% | 83.33% |
Technique | p-Value (Proposed Method vs. Other Sampling Techniques | ||
---|---|---|---|
Precision | Recall | F-Measure | |
Random forest with random oversampling | 0.0029 | 0.1117 | 0.0090 |
Random forest with Tomek links | 0.0045 | 0.0737 | 0.0310 |
Depression Symptom | Number of Data | Fraction (%) |
---|---|---|
No symptoms | 113 | 68.48% |
Mild | 42 | 25.45% |
Moderate | 9 | 5.45% |
Severe | 1 | 0.62% |
Accuracy 93.33% | |||||
---|---|---|---|---|---|
Actual 0 | Actual 1 | Actual 2 | Actual 3 | Precision (%) | |
Pred.0 | 112 | 7 | 0 | 0 | 94.12% |
Pred.1 | 1 | 34 | 2 | 0 | 91.90% |
Pred.2 | 0 | 1 | 7 | 0 | 87.50% |
Pred.3 | 0 | 0 | 0 | 1 | 100% |
Recall (%) | 99.12% | 80.95% | 77.78% | 100% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sawangarreerak, S.; Thanathamathee, P. Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression. Information 2020, 11, 519. https://doi.org/10.3390/info11110519
Sawangarreerak S, Thanathamathee P. Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression. Information. 2020; 11(11):519. https://doi.org/10.3390/info11110519
Chicago/Turabian StyleSawangarreerak, Siriporn, and Putthiporn Thanathamathee. 2020. "Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression" Information 11, no. 11: 519. https://doi.org/10.3390/info11110519
APA StyleSawangarreerak, S., & Thanathamathee, P. (2020). Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression. Information, 11(11), 519. https://doi.org/10.3390/info11110519