Comparative Analysis of PCOS Classification Using Random Forest: Integration of Mutual Information, SMOTE-Tomek, and Outlier Handling

  • Selviana Dwi Aprianti Faculty of Computer Science, Universitas Dian Nuswantoro
  • Farrikh Alzami Faculty of Computer Science, Universitas Dian Nuswantoro
  • Ifan Rizqa Faculty of Computer Science, Universitas Dian Nuswantoro
  • Ricardus Anggi Pramunendar Faculty of Computer Science, Universitas Dian Nuswantoro
  • Rama Aria Megantara Faculty of Computer Science, Universitas Dian Nuswantoro
  • Muhammad Naufal Faculty of Computer Science, Universitas Dian Nuswantoro
  • Dwi Puji Prabowo Faculty of Computer Science, Universitas Dian Nuswantoro
Abstract views: 141 , PDF downloads: 78
Keywords: PCOS, Random Forest, Outlier Detection, Feature Selection, SMOTE-Tomek, Medical Classification

Abstract

Polycystic Ovary Syndrome (PCOS) is a hormonal disorder affecting women of reproductive age, with a global prevalence rate of 8–13%. However, approximately 70% of cases remain undiagnosed. This study aimed to develop and compare eight Random Forest classification models for PCOS detection using a publicly available Kaggle dataset. The methodology incorporated three key preprocessing techniques: outlier handling using the Interquartile Range (IQR) method, feature selection through Mutual Information, and class imbalance via SMOTE-Tomek. The results revealed that the best-performing model, which applied outlier removal and SMOTE without feature selection, achieved an accuracy of 94.11%. This result significantly outperformed the baseline Random Forest model, which achieved an accuracy of 87.27% without the application of any preprocessing techniques, such as outlier removal, SMOTE, or feature selection. Moreover, the model utilizing only SMOTE for class balancing achieved an accuracy of 93.84%, underscoring the importance of addressing class imbalance in enhancing classification performance. Notably, feature selection did not consistently improve accuracy, as Random Forest inherently handles feature redundancy, capturing complex feature interactions. These findings highlight the importance of tailored preprocessing strategies, particularly outlier handling and class balancing, for optimizing medical data classification. Future research should explore clinically informed feature selection techniques and assess the generalizability of these findings across diverse datasets to enhance the clinical relevance of PCOS detection models.

References

A. Yasmin et al., “Polycystic Ovary Syndrome: An Updated Overview Foregrounding Impacts of Ethnicities and Geographic Variations,” Life, vol. 12, no. 12, p. 1974, Nov. 2022, doi: 10.3390/life12121974.

“Polycystic ovary syndrome,” World Health Organization, 2023. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/polycystic-ovary-syndrome

H. Elmannai et al., “Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized Feature Selection and Explainable Artificial Intelligence,” Diagnostics, vol. 13, no. 8, p. 1506, Apr. 2023, doi: 10.3390/diagnostics13081506.

V. V. Khanna, K. Chadaga, N. Sampathila, S. Prabhu, V. Bhandage, and G. K. Hegde, “A Distinctive Explainable Machine Learning Framework for Detection of Polycystic Ovary Syndrome,” ASI, vol. 6, no. 2, p. 32, Feb. 2023, doi: 10.3390/asi6020032.

B. C. Sydora, M. S. Wilke, M. McPherson, S. Chambers, M. Ghosh, and D. F. Vine, “Challenges in diagnosis and health care in polycystic ovary syndrome in Canada: a patient view to improve health care,” BMC Women’s Health, vol. 23, no. 1, p. 569, Nov. 2023, doi: 10.1186/s12905-023-02732-2.

Ch. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, “An outliers detection and elimination framework in classification task of data mining,” Decision Analytics Journal, vol. 6, p. 100164, Mar. 2023, doi: 10.1016/j.dajour.2023.100164.

A. Bajcsi, A. Andreica, and C. Chira, “Towards feature selection for digital mammogram classification,” Procedia Computer Science, vol. 192, pp. 632–641, Jan. 2021, doi: 10.1016/j.procs.2021.08.065.

S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimed Tools Appl, vol. 82, no. 22, pp. 34163–34181, Sep. 2023, doi: 10.1007/s11042-023-15165-8.

S. C. R. Nandipati, C. XinYing, and K. K. Wah, “Polycystic Ovarian Syndrome (PCOS) Classification and Feature Selection by Machine Learning Techniques,” vol. 9, 2020.

M. Alalhareth and S.-C. Hong, “An Improved Mutual Information Feature Selection Technique for Intrusion Detection Systems in the Internet of Medical Things,” Sensors, vol. 23, no. 10, p. 4971, May 2023, doi: 10.3390/s23104971.

G. Manikandan and S. Abirami, “An efficient feature selection framework based on information theory for high dimensional data,” Applied Soft Computing, vol. 111, p. 107729, Nov. 2021, doi: 10.1016/j.asoc.2021.107729.

D. Shabrina Assyifa and A. Luthfiarta, “SMOTE-Tomek Re-sampling Based on Random Forest Method to Overcome Unbalanced Data for Multi-class Classification,” Inf. J. Ilm. Bid. Teknol. Inf. dan Komun., vol. 9, no. 2, pp. 151–160, Jul. 2024, doi: 10.25139/inform.v9i2.8410.

H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” JOIV : Int. J. Inform. Visualization, vol. 7, no. 1, p. 258, Feb. 2023, doi: 10.30630/joiv.7.1.1069.

S. Tiwari et al., “SPOSDS: A smart Polycystic Ovary Syndrome diagnostic system using machine learning,” Expert Systems with Applications, vol. 203, p. 117592, Oct. 2022, doi: 10.1016/j.eswa.2022.117592.

I. H. Hassan, M. Abdullahi, M. M. Aliyu, S. A. Yusuf, and A. Abdulrahim, “An improved binary manta ray foraging optimization algorithm based feature selection and random forest classifier for network intrusion detection,” Intelligent Systems with Applications, vol. 16, p. 200114, Nov. 2022, doi: 10.1016/j.iswa.2022.200114.

“PCOS Dataset.” [Online]. Available: https://www.kaggle.com/datasets/shreyasvedpathak/pcos-dataset

M. A. Latief, L. R. Nabila, W. Miftakhurrahman, S. Ma’rufatullah, and H. Tantyoko, “Handling Imbalance Data using Hybrid Sampling SMOTE-ENN in Lung Cancer Classification,” IJECSA, vol. 3, no. 1, pp. 11–18, Feb. 2024, doi: 10.30812/ijecsa.v3i1.3758.

O. P. Ige and K. H. Gan, “Ensemble Filter-Wrapper Text Feature Selection Methods for Text Classification,” CMES - Computer Modeling in Engineering and Sciences, vol. 141, no. 2, pp. 1847–1865, Sep. 2024, doi: 10.32604/cmes.2024.053373.

O. Lifandali, N. Abghour, and Z. Chiba, “Feature Selection Using a Combination of Ant Colony Optimization and Random Forest Algorithms Applied To Isolation Forest Based Intrusion Detection System,” Procedia Computer Science, vol. 220, pp. 796–805, Jan. 2023, doi: 10.1016/j.procs.2023.03.106.

S. Devella, Y. Yohannes, and F. N. Rahmawati, “Implementasi Random Forest Untuk Klasifikasi Motif Songket Palembang Berdasarkan SIFT,” JATISI, vol. 7, no. 2, pp. 310–320, Aug. 2020, doi: 10.35957/jatisi.v7i2.289.

H. Nalatissifa, W. Gata, S. Diantika, and K. Nisa, “Perbandingan Kinerja Algoritma Klasifikasi Naive Bayes, Support Vector Machine (SVM), dan Random Forest untuk Prediksi Ketidakhadiran di Tempat Kerja,” JIUP, vol. 5, no. 4, p. 578, Dec. 2021, doi: 10.32493/informatika.v5i4.7575.

M. Sivakumar, S. Parthasarathy, and T. Padmapriya, “Trade-off between training and testing ratio in machine learning for medical image processing,” PeerJ Computer Science, vol. 10, p. e2245, Sep. 2024, doi: 10.7717/peerj-cs.2245.

A. Martinez-Velasco, L. Martínez -Villaseñor, and L. Miralles-Pechuán, “Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia,” IEEE Latin Am. Trans., vol. 22, no. 10, pp. 806–820, Oct. 2024, doi: 10.1109/TLA.2024.10705995.

A. Singh and T. Margaria, “Enhancing Decision-Making for Imbalanced Medical Datasets Using BDDs and Low-Code/No-Code,” IT Professional, vol. 26, no. 5, pp. 92–98, Oct. 2024, doi: 10.1109/MITP.2024.3459248.

Published
2025-02-01
How to Cite
Aprianti, S. D., Alzami, F., Rizqa, I., Pramunendar, R. A., Megantara, R. A., Naufal, M., & Prabowo, D. P. (2025). Comparative Analysis of PCOS Classification Using Random Forest: Integration of Mutual Information, SMOTE-Tomek, and Outlier Handling. Inform : Jurnal Ilmiah Bidang Teknologi Informasi Dan Komunikasi, 10(1), 78-87. https://doi.org/10.25139/inform.v10i1.9231
Section
Articles