Prediction of heart disease by classifying with feature selection and machine learning methods

Cengiz Gazeloglu

doi:10.23751/pn.v22i2.9830

Prediction of heart disease by classifying with feature selection and machine learning methods

Authors

Cengiz Gazeloglu Faculty of Science Literature, Department of Statistics, Suleyman Demirel Universtiy, Isparta, Turkey

Keywords:

Machine Learning, Feature Selection, Heart Disease, Classification, Artificial Intelligence

Abstract

Study Objectives: Cardiovascular diseases are among the most common diseases experienced by human beings. In addition, these diseases require spending too much money to be treated. According to the World Health Organization report, 56 million death cases occurred in the World in 2012. Methods: The aims to determine the method (s) with the most accurate classification rate of cardiovascular diseases by using machine learning and feature selection methods. To fulfill this aim, 18 machine learning methods divided into 6 different categories and 3 different feature selection was used in this study. These methods were analyzed via WEKA, Python and MATLAB computer program. Results: According to the results of the analysis, SVM (PolyKernel) with an 85.148% ratio was found to be the most successful machine learning algorithm without feature selection. After the Correlation-based Feature Selection (CFS) feature selection, the most successful algorithm was Naive Bayes and Fuzzy RoughSet with a ratio of 84.818%. However, after using Chi-Square feature selection, the most successful algorithm was found to be the RBF Network algorithm with 81.188% ratio. Conclusion: Consequently, it is recommended that specialist doctors who want to classify heart disease should use the SVM (PolyKernel) algorithm if they are not going to use feature selection whereas they should use should the Naive Bayes algorithm if they are going to use CFS as a feature selection. Additionally, if they are to use Fuzzy Rough Set and Chi-Square as the feature selection, it is recommended that they use the RBFNetwork algorithm.

References

Global status report on noncommunicable diseases WHO. https://www.kisa.link/NtkY, Accessed Date: 10. August. 2019

Galit S, Nitin RP, Peter CB. Data Mining for Business İntelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with Xlminer, Wiley Publishing, 2010.

Saracli S. Performance of rand’s C statistics in clustering analysis: an application to clustering the regions of Turkey, Journal of Inequalities and Applications 2013; 1-142.

Dawid W, Hosmer SL. Applied Logistic Regression, 2th ed. A Wiley Interscience Publication; 2000.

Jiang L, Zhang H, Cai Z. A novel bayes model: hidden naive bayes. IEEE T Knowl Data En 2009; 21(10): 1361-1371.

Xing Y, Wang J, Zhao Z. Combination Data Mining Methods with New Medical Data to Predicting Outcome of Coronary Heart Disease, ICCI Presented 2007.

Atılgan E, Karayollarında Meydana Gelen Trafik Kazalarının Karar Ağaçları ve Birliktelik Analizi ile Incelenmesi, Hacettepe Üniversitesi, Fen Bilimleri Enstitüsü Bilim Uzmanlığı Tezi, Ankara, 2011.

Wang S, Jiang L, Li C. Adapting naive Bayes tree for text classification. Knowledge and Information Systems 2015; 44(1): 77–89.

Meenachi L, Ramakrishnan S, Arunithi M, Karthiga R, Karthika S, Nandhini P. Diagnosis of cancer using fuzzy rough set theory. International Research Journal of Engineering and Technology (IRJET) 2015; 3(1): 1203-1208.

Maillo J, Luengo J, Garc´ ıa S, Herrera F. A preliminary study on Hybrid Spill-Tree Fuzzy k-Nearest Neighbors for big data classification, IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) 2018.

Nearest Neigbour Classifier https://www.kisa.link/Ntl5, Accessed Date: 25. July. 2019

Pawlak Z. Rough sets and intelligent data analysis. Information Sciences 2002; 147(1):1–12.

Vapnik V N. Estimation of Dependences Based on Empirical Data. Springer Verlag.; 1982.

Breiman L. Random Forests. Kluwer Academic Publishers. Manufactured in The Netherlands 2001; 45:5–32.

Purohit A, Choudhari NS, Tiwari A. A new mutation operator in genetic programming, ictact journal on soft computing 2013; 3(2): 467-471.

Hall AM. Correlation-based feature selection for machine learning Tech. Rep., Doctoral Disertation, University of Waikato, Department of Computer Science 1999.

Kumar M, Yadav N. Fuzzy rough sets and its application in data mining field, Advances in Computer Science and Information Technology (ACSIT) 2015; 2(3): 237-240.

Heart Disease UCI, https://www.kaggle.com/ronitf/heart-disease-uci, Accessed Date: 20. April. 2019

Altman DG, Bland JM. Diagnostic tests. 1: Sensitivity and specificity. BMJ. 1994; 308 (6943): 1552.

Kılıç S. Kappa Testi. Journal of Mood Disorders 2015; 5(3):142-144.

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33(1):159-74.

Liu X. et al., A hybrid classification system for heart disease diagnosis. Computational and Mathematical Methods in Medicine 2017; 2017: 1-11.

Vembandasamy K, Sasipriya R, Deepa E. Heart diseases detection using naive bayes algorithm. International Journal of Innovative Science. Engineering & Technology 2015; 2(9): 441–444.

Das R, Turkoglu I, Sengur A. Effective diagnosis of heart disease through neural networks ensembles. Expert systems with applications 2009; 36(4): 7675-7680.

Chen A. et al., HDPS: Heart disease prediction system. In Computing in Cardiology, Hangzhou, China: IEEE 2011; 38: 557–560.

Dangare C, Apte S. A data mining approach for prediction of heart disease using neural networks. International Journal of Computer Engineering & Technology 2012; 3(3): 30–40.

Sabarinathan V, Sugumaran V. Diagnosis of heart disease using decision tree. International Journal of Research in Computer Applications & Information Technology 2014; 2:74–79.

Patel J. et al., Heart disease prediction using machine learning and data mining technique. Heart Disease 2015; 7(1):129–137.

Shouman M, Turner T, Stocker R. Applying k-nearest neighbour in diagnosing heart disease patients. International Journal of Information and Education Technology 2012; 2(3):220.

Wiharto W, Kusnanto H, Herianto H. Performance analysis of multiclass support vector machine classification for diagnosis of coronary heart diseases. International Journal on Computational Science & Applications 2015; 5(5): 27–37.

Khateeb N, Usman M. Efficient heart disease prediction system using k-nearest neighbor classification technique. In Proceedings of the International Conference on Big Data and Internet of Thing (BDIOT), New York, NY, USA: ACM, 2017; 21–26.

Pouriyeh S. et al., A comprehensive investigation and comparison of machine learning techniques in the domain of heart disease. In Proceedings of IEEE Symposium on Computers and Communications (ISCC). Heraklion, Greece: IEEE, 2017; 204–207.

Waghulde N, Patil N. Genetic neural approach for heart disease prediction. International Journal of Advanced Computer Research 2014; 4(3): 778.

Venkatalakshmi B, Shivsankar M. Heart disease diagnosis using predictive data mining. International Journal of Innovative Research in Science, Engineering and Technology, 2014; 3(3): 1873–1877.

Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. In IEEE/ACS International Conference on Computer Systems and Applications. Doha, Qatar 2008; 8(8):108–115.

Liu X. et al., A hybrid classification system for heart disease diagnosis. Computational and Mathematical Methods in Medicine 2017; 2017: 1-11.

Ghumbre S, Patil C, Ghatol A. Heart disease diagnosis using support vector machine. In International conference on computer science and information technology. Pattaya, Thailand: Planetary Scientific Research Centre 2011; 84–88.

Masethe H, Masethe M. Prediction of heart disease using classification algorithms. In Proceedings of the world congress on Engineering and Computer Science, San Francisco, USA: International Association of Engineers (IAENG) 2014; 2: 22–24.

Dangare C, Apte S. Improved study of heart disease prediction system using data mining classification techniques. International Journal of Computer Applications 2012; 47(10): 44–48.