An overview of diabetes diagnosis methods on the Pima Indian dataset

Document Type : Research Paper


1 Department of Computer Science, Faculty of Mathematics and Computer, Shahid Bahonar University of Kerman, Kerman, Iran

2 School of Health, Isfahan University of Medical Sciences, Isfahan, Iran


In recent years, data mining and machine learning methods in the medical field have received much attention and have optimized many complex issues in the medical field. One of the problems facing researchers is the appropriate dataset, and the suitable dataset on which different methods of data mining and machine learning can be applied is rarely found. One of the most reliable and appropriate datasets in the field of diabetes diagnosis is the Indian Survey Database. In this article, we have tried to review the methods that have been implemented in recent years using machine learning classification algorithms on this data set and compare these methods in terms of evaluation criteria and feature selection methods. After comparing these methods, it was found that models that used feature selection methods were more accurate than other approaches.


Main Subjects

[1] Akyol, K. (2017). Assessing the importance of attributes for diagnosis of diabetes disease. International Journal of Information Engineering and Electronic Business, 9(5), 1-9.
[2] Bashir, S., Qamar, U., & Khan, F. H. (2016). IntelliHealth: a medical decision support application using a novel weighted multi-layer classi er ensemble framework. Journal of Biomedical Informatics, 59, 185-200.
[3] Bellazzi, R., & Zupan, B. (2008). Predictive data mining in clinical medicine: current issues and guidelines. International Journal of medical informatics, 77(2), 81-97.
[4] BERGER, A. M., & BERGER, C. R. (2004). Data mining as a tool for research and knowledge development in nursing. CIN: Computers, Informatics, Nursing, vol. 22, no. 3, pp. 123-131.
[5] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5{32.
[6] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition, Data mining and knowledge discovery, 2(2), 121-167.
[7] Caliskan, A., Yuksel, M. E., Badem, H., & Basturk, A. (2018). Performance improvement of deep neural network classi ers by a simple training strategy. Engineering Applications of Arti cial Intelligence, 67, 14-23.
[8] Chakrabarti, S., Ester, M., Fayyad, U., Gehrke, J., Han, J., Morishita, S., Shapiro, G.P., & Wang, W. (2006). Data mining curriculum: A proposal (Version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, 140, 1-10.
[9] Chang, V., Bailey, J., Xu, Q.A., & Sun, Z. Pima Indians diabetes mellitus classi cation based on machine learning (ML) algorithms. Neural Computing and Applications.
[10] Chapelle, O., Ha ner, P., & Vapnik, V. N. (1999). Support vector machines for histogram-based image classi cation. IEEE transactions on Neural Networks, 10(5), 1055-1064.
[11] Chikh, M.A., Saidi, M., & Settouti, N.(2012). Diagnosis of diabetes diseases using an Arti cial Immune Recognition System2 (AIRS2) with fuzzy K-nearest neighbor. Journal of Medical Systems, 36(5), 2721-2729.
[12] Cho, S., May, G., Tourkogiorgis, I., Perez, R., Lazaro, O., de La Maza, B., & Kiritsis, D. (2018). A hybrid machine learning approach for predictive maintenance in smart factories of the future. proceedings of the IFIP International Conference on Advances in Production Management Systems, Springer, 311-317. 39.
[13] Choubey, D. K., & Paul, S. (2016). GAMLP NN: A hybrid intelligent system for diabetes disease diagnosis. International Journal of Intelligent Systems and Applications, 8(1), 49.
[14] Choubey, D. K., & Paul, S. (2015). GAJ 48graft DT: A hybrid intelligent system for diabetes disease diagnosis. International Journal of Bio-Science and Bio-Technology, 7(5), 135-150.
[15] Cios, K. J., Moore, G. W. (2002). Uniqueness of medical data mining. Arti cial Intelligence in Medicine, 26(1-2), 1-24. .
[16] Devi, R.D.H., Bai, A., & Nagarajan, N. (2020). A novel hybrid approach for diagnosing diabetes mellitus using farthest  rst and support vector machine algorithms. Obesity Medicine, 17, 100152.
[17] Dzulkalnine, M. F., & Sallehuddin, R. (2019). Missing data imputation with fuzzy feature selection for diabetes dataset. SN Applied Sciences, 1(4), 362.
[18] Elavarasan, D., Vincent, D. R., Sharma, V., Zomaya, A. Y., & Srinivasan, K. (2018). Forecasting yield by integrating agrarian factors and machine learning models: A survey, Computers and Electronics in Agriculture, 155, 257-282.
[19] Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for pruning decision trees. IEEE transactions on pattern analysis and machine intelligence, 19(5), 476-491.
[20] El-Habil, A. M. (2012). An application on multinomial logistic regression model. Pakistan journal of statistics and operation research, 271-291.
[21] Federation, I. D. (2019).IDF Diabetes Atlas 2019. ed: International Diabetes Federation.
[22] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[23] Guo, G. ,Wang, H. , Bell, D., Bi, Y., & Greer, K. ( 2003). KNN modelbased approach in classi cation. proceedings of the OTM Confederated International Conferences On the Move to Meaningful Internet Systems, Springer, 986-996. https : ==doi:org=10:1007=978 􀀀 3 􀀀 540 􀀀 39964 􀀀 3 􀀀 62:
[24] Hartigan, J.A., & Wong, M. A. (1979). Algorithm AS 136: a k-means clustering algorithm, J. R. Stat. Soc. Ser. C Appl. Stat. 28 (1), 100{108.
[25] (Accessed 4 May 2019).
[26] (Accessed 8 August 2020).
[27] (Accessed 13 August 2020).
[28] Jackins, V., Vimal, S., Kaliappan, M., & Lee, M. Y. (2021). AI-based smart prediction of clinical disease using random forest classi er and Naive Bayes. The Journal of Supercomputing, 77(5), 5198-5219.
[29] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, in European conference on machine learning, Springer, 137-142.
[30] Kannadasan, K., Edla, D. R., & Kuppili, V. (2019). Type 2 diabetes data classi cation using stacked autoencoders in deep neural networks. Clinical Epidemiology and Global Health, 7(4), 530-535.
[31] Kaur, H., & Kumari, V. (2020). Predictive modelling and analytics for diabetes using a machine learning approach," Applied Computing and Informatics.
[32] Mitchell, M. (1996). Chapter 3: Genetic Algorithms in Scienti c Models, An Introduction to Genetic Algorithms. The MIT Press, Cambridge, MA, 85-108.
[33] Patil, B. M., Joshi, R. C., & Toshniwal, D. (2010). Hybrid prediction model for type-2 diabetic patients. Expert Systems with Applications, 37(12), 8102-8108.
[34] Petricoin, E.F., Ardakani, A.M., Hitt, B.A., Levine, P.L., Fusarob, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishmen, D.A., Kohn, E.C., & liotta, L.A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The lancet, 359(9306), 572-577.
[35] Piryonesi, S. M., & El-Diraby, T.E. (2020). Data analytics in asset management: Coste  ective prediction of the pavement condition index. Journal of Infrastructure Systems, 26(1), 04019036.
[36] Rahim, M.A., Hossain, M.A., Hossain, M.N., Shin, J., & Yun, K.S. (2023). Ensemble-Based Type-2 Diabetes Prediction Using Machine Learning Techniques. Annals of Emerging Technologies in Computing, 7(1), 30{39.
[37] Richards, G., Rayward-Smith, V. J, Sonksen, P. , Carey, S., & Weng, C. (2001). Data mining for indicators of early mortality in a database of clinical records. Arti cial Intelligence in Medicine, 22(3), 215-231.
[38] Seera, M., & Lim, C. P. (2014). A hybrid intelligent system for medical data classi cation. Expert Systems with Applications, 41(5), 2239-2249.
[39] Singh, N., & Singh, P. (2020). Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus, Biocybernetics and Biomedical Engineering, 40(1), 1-22.
[40] Song, M. (2008). Biomedical ontologies and text mining for biomedicine and healthcare: A survey. Journal of Computing Science Engineering, 2(2), 109-136.
[41] Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: A critical evaluation of several techniques. Journal of Classi cation, 24(1), 99-121.
[42] Vadeyar, D.A., & Yogish, H. (2014). Farthest  rst clustering in links reorganization. International Journal of Web and Semantic Technology, 5(3), 17.
[43] Velickov, S., & Solomatine, D. (2000). Predictive data mining: practical examples. proceedings of the 2nd Joint Workshop on Applied AI in Civil Engineering.
[44] Velu, C., & Kashwan, K. (2013). Visual data mining techniques for classi cation of diabetic patients. proceedings of the 3rd IEEE International Advance Computing Conference (IACC), IEEE, 1070-1075.
[45] Wu, L., Peng, Y., Fan, J., Wang, Y., & Huang, G. (2021). A novel kernel extreme learning machine model coupled with K-means clustering and  rey algorithm for estimating monthly reference evapotranspiration in parallel computation. Agricultural Water Management, 245, 106624.
[46] Wu, H., Yang, S., Huang, Z., He, J., & Wang, X. (2018). Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked, 10, 100-107.
[47] Yoo, I., Alafaireet, P., Marinov, M., Hernandez, K. P., Gopidi, R., Chang, J., & Hua, L. (2012). Data mining in healthcare and biomedicine: a survey of the literature, Journal of medical systems, 36(4), 2431-2448.
[48] Zhan, M., Chen, Z. B., Ding, C. C., Qu, Q., Wang, G. Q., Liu, S., & Wen, F. Q. (2021). Machine learning to predict high-dose methotrexate-related neutropenia and fever in children with B-cell acute lymphoblastic leukemia. Leukemia and Lymphoma, 1-12.