Comparative Analysis of PCOS Classification Using Random Forest: Integration of Mutual Information, SMOTE-Tomek, and Outlier Handling


Abstract
Polycystic Ovary Syndrome (PCOS) is a hormonal disorder affecting women of reproductive age, with a global prevalence rate of 8–13%. However, approximately 70% of cases remain undiagnosed. This study aimed to develop and compare eight Random Forest classification models for PCOS detection using a publicly available Kaggle dataset. The methodology incorporated three key preprocessing techniques: outlier handling using the Interquartile Range (IQR) method, feature selection through Mutual Information, and class imbalance via SMOTE-Tomek. The results revealed that the best-performing model, which applied outlier removal and SMOTE without feature selection, achieved an accuracy of 94.11%. This result significantly outperformed the baseline Random Forest model, which achieved an accuracy of 87.27% without the application of any preprocessing techniques, such as outlier removal, SMOTE, or feature selection. Moreover, the model utilizing only SMOTE for class balancing achieved an accuracy of 93.84%, underscoring the importance of addressing class imbalance in enhancing classification performance. Notably, feature selection did not consistently improve accuracy, as Random Forest inherently handles feature redundancy, capturing complex feature interactions. These findings highlight the importance of tailored preprocessing strategies, particularly outlier handling and class balancing, for optimizing medical data classification. Future research should explore clinically informed feature selection techniques and assess the generalizability of these findings across diverse datasets to enhance the clinical relevance of PCOS detection models.
References
A. Yasmin et al., “Polycystic Ovary Syndrome: An Updated Overview Foregrounding Impacts of Ethnicities and Geographic Variations,” Life, vol. 12, no. 12, p. 1974, Nov. 2022, doi: 10.3390/life12121974.
“Polycystic ovary syndrome,” World Health Organization, 2023. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/polycystic-ovary-syndrome
H. Elmannai et al., “Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized Feature Selection and Explainable Artificial Intelligence,” Diagnostics, vol. 13, no. 8, p. 1506, Apr. 2023, doi: 10.3390/diagnostics13081506.
V. V. Khanna, K. Chadaga, N. Sampathila, S. Prabhu, V. Bhandage, and G. K. Hegde, “A Distinctive Explainable Machine Learning Framework for Detection of Polycystic Ovary Syndrome,” ASI, vol. 6, no. 2, p. 32, Feb. 2023, doi: 10.3390/asi6020032.
B. C. Sydora, M. S. Wilke, M. McPherson, S. Chambers, M. Ghosh, and D. F. Vine, “Challenges in diagnosis and health care in polycystic ovary syndrome in Canada: a patient view to improve health care,” BMC Women’s Health, vol. 23, no. 1, p. 569, Nov. 2023, doi: 10.1186/s12905-023-02732-2.
Ch. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, “An outliers detection and elimination framework in classification task of data mining,” Decision Analytics Journal, vol. 6, p. 100164, Mar. 2023, doi: 10.1016/j.dajour.2023.100164.
A. Bajcsi, A. Andreica, and C. Chira, “Towards feature selection for digital mammogram classification,” Procedia Computer Science, vol. 192, pp. 632–641, Jan. 2021, doi: 10.1016/j.procs.2021.08.065.
S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimed Tools Appl, vol. 82, no. 22, pp. 34163–34181, Sep. 2023, doi: 10.1007/s11042-023-15165-8.
S. C. R. Nandipati, C. XinYing, and K. K. Wah, “Polycystic Ovarian Syndrome (PCOS) Classification and Feature Selection by Machine Learning Techniques,” vol. 9, 2020.
M. Alalhareth and S.-C. Hong, “An Improved Mutual Information Feature Selection Technique for Intrusion Detection Systems in the Internet of Medical Things,” Sensors, vol. 23, no. 10, p. 4971, May 2023, doi: 10.3390/s23104971.
G. Manikandan and S. Abirami, “An efficient feature selection framework based on information theory for high dimensional data,” Applied Soft Computing, vol. 111, p. 107729, Nov. 2021, doi: 10.1016/j.asoc.2021.107729.
D. Shabrina Assyifa and A. Luthfiarta, “SMOTE-Tomek Re-sampling Based on Random Forest Method to Overcome Unbalanced Data for Multi-class Classification,” Inf. J. Ilm. Bid. Teknol. Inf. dan Komun., vol. 9, no. 2, pp. 151–160, Jul. 2024, doi: 10.25139/inform.v9i2.8410.
H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” JOIV : Int. J. Inform. Visualization, vol. 7, no. 1, p. 258, Feb. 2023, doi: 10.30630/joiv.7.1.1069.
S. Tiwari et al., “SPOSDS: A smart Polycystic Ovary Syndrome diagnostic system using machine learning,” Expert Systems with Applications, vol. 203, p. 117592, Oct. 2022, doi: 10.1016/j.eswa.2022.117592.
I. H. Hassan, M. Abdullahi, M. M. Aliyu, S. A. Yusuf, and A. Abdulrahim, “An improved binary manta ray foraging optimization algorithm based feature selection and random forest classifier for network intrusion detection,” Intelligent Systems with Applications, vol. 16, p. 200114, Nov. 2022, doi: 10.1016/j.iswa.2022.200114.
“PCOS Dataset.” [Online]. Available: https://www.kaggle.com/datasets/shreyasvedpathak/pcos-dataset
M. A. Latief, L. R. Nabila, W. Miftakhurrahman, S. Ma’rufatullah, and H. Tantyoko, “Handling Imbalance Data using Hybrid Sampling SMOTE-ENN in Lung Cancer Classification,” IJECSA, vol. 3, no. 1, pp. 11–18, Feb. 2024, doi: 10.30812/ijecsa.v3i1.3758.
O. P. Ige and K. H. Gan, “Ensemble Filter-Wrapper Text Feature Selection Methods for Text Classification,” CMES - Computer Modeling in Engineering and Sciences, vol. 141, no. 2, pp. 1847–1865, Sep. 2024, doi: 10.32604/cmes.2024.053373.
O. Lifandali, N. Abghour, and Z. Chiba, “Feature Selection Using a Combination of Ant Colony Optimization and Random Forest Algorithms Applied To Isolation Forest Based Intrusion Detection System,” Procedia Computer Science, vol. 220, pp. 796–805, Jan. 2023, doi: 10.1016/j.procs.2023.03.106.
S. Devella, Y. Yohannes, and F. N. Rahmawati, “Implementasi Random Forest Untuk Klasifikasi Motif Songket Palembang Berdasarkan SIFT,” JATISI, vol. 7, no. 2, pp. 310–320, Aug. 2020, doi: 10.35957/jatisi.v7i2.289.
H. Nalatissifa, W. Gata, S. Diantika, and K. Nisa, “Perbandingan Kinerja Algoritma Klasifikasi Naive Bayes, Support Vector Machine (SVM), dan Random Forest untuk Prediksi Ketidakhadiran di Tempat Kerja,” JIUP, vol. 5, no. 4, p. 578, Dec. 2021, doi: 10.32493/informatika.v5i4.7575.
M. Sivakumar, S. Parthasarathy, and T. Padmapriya, “Trade-off between training and testing ratio in machine learning for medical image processing,” PeerJ Computer Science, vol. 10, p. e2245, Sep. 2024, doi: 10.7717/peerj-cs.2245.
A. Martinez-Velasco, L. Martínez -Villaseñor, and L. Miralles-Pechuán, “Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia,” IEEE Latin Am. Trans., vol. 22, no. 10, pp. 806–820, Oct. 2024, doi: 10.1109/TLA.2024.10705995.
A. Singh and T. Margaria, “Enhancing Decision-Making for Imbalanced Medical Datasets Using BDDs and Low-Code/No-Code,” IT Professional, vol. 26, no. 5, pp. 92–98, Oct. 2024, doi: 10.1109/MITP.2024.3459248.
Copyright (c) 2025 Selviana Dwi Aprianti, Farrikh Alzami, Ifan Rizqa, Ricardus Anggi Pramunendar, Rama Aria Megantara, Muhammad Naufal, Dwi Puji Prabowo

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.