SMOTE-Tomek Re-sampling Based on Random Forest Method to Overcome Unbalanced Data for Multi-class Classification


Abstract
Mobile app review data needs to be utilized to understand the app characteristics desired by users. App providers can improve app performance based on user preferences by using sentiment and emotion classification on app review data. However, problems that often arise in text-based analysis are data variation and data imbalance. This can lead to biased and inaccurate classification models. It is necessary to perform pre-processing to comprehend the data requirements and implement feature extraction for word weighting to overcome the variation in the data. In addition, re-sampling techniques are also needed to overcome the imbalance in sample distribution. Re-sampling techniques such as Tomek Links and SMOTE only focus on majority or minority data. This research applies the SMOTE-Tomek merging technique, aiming at not only the minority data but also the majority data. The model performance becomes better because the technique combines oversampling and under-sampling of the majority of data to eliminate the noise of data. The data was modeled using an Ensemble Learning Random Forest for classification. The model performance resulted in a Precision value of 84%, Recall of 84%, F1-Score of 84%, and Accuracy of 84%. Furthermore, the model was optimized using GridSearchCV and obtained an increase in Precision 85%, Recall 85%, F1-Score 85%, and Accuracy 85%.
References
A. Rafid Rizqullah, A. Wedhasmara, R. Izwan Heroza, A. Putra, and P. Putra, “Analisis Masalah Pada Data Review Aplikasi Terhadap Layanan E-Commerce Menggunakan Metode Text Classification,” Jurnal Tekno Kompak, vol. 16, no. 1, pp. 186–198, 2022, doi: 10.33365/jtk.v16i1.1448.
Ceci Laura, "Number of mobile app downloads worldwide from 2021 to 2023 by country," Data.ai. Accessed: May 22, 2024. [Online]. Available: https://www.statista.com/statistics/1287159/app-downloads-by-country
P. Br Sihotang, F. Dameka Br Sitanggang, N. Azriansyah, and E. Indra, “Penerapan Natural Language Processing Untuk Analisis Sentimen Terhadap Aplikasi Streaming,” Jurnal Ilmiah Betrik, vol. 14, no. 02, pp. 273–282, 2023, doi: 10.36050/betrik.v14i02%20AGUSTUS.96.
F. A. Larasati, D. E. Ratnawati, and B. T. Hanggara, “Analisis Sentimen Ulasan Aplikasi Dana dengan Metode Random Forest,” Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, vol. 6, no. 9, pp. 4305–4313, 2022.
D. R. Wulandari, C. Setianingsih, and F. M. Dirgantara, "Deteksi Emosi Berbasis Teks Untuk Menganalisis Kuliah Daring Selama Masa Pandemi Menggunakan Algoritme Naive Bayes Text Based Emotion Detection For Analysis Online Lecture During Pandemic Using Naive Bayes Algorithm," in eProceedings of Engineering, 2022, pp. 1908–1915.
R. Dwi Fitriani, H. Yasin, D. Statistika, and F. Sains dan Matematika, “Penanganan Klasifikasi Kelas Data Tidak Seimbang Dengan Random Oversampling Pada Naive Bayes (Studi Kasus: Status Peserta Kb Iud Di Kabupaten Kendal),” Jurnal Gaussian, vol. 10, no. 1, pp. 11–20, 2021, doi: 10.14710/j.gauss.10.1.11-20.
I. Ayu Mirah Cahya Dewi, I. Komang Dharmendra, N. Wayan Setiasih, F. Informatika dan Komputer, and I. Teknologi dan Bisnis STIKOM Bali, “Analisis Sentimen Review Aplikasi Satu Sehat Mobile Menggunakan Model Sampling Tomek Links,” Jurnal Teknologi Informasi dan Komputer, vol. 9, no. 5, pp. 497–504, 2023, doi: 10.36002/jutik.v9i5.2644.
G. Lemaitre, F. Nogueira, and C. K. Aridas, "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning," Journal of Machine Learning Research}, vol. 18, no. 17, pp. 1–5, 2017, Accessed: Apr. 25, 2024.
Z. Wang, C. Wu, K. Zheng, X. Niu, and X. Wang, "SMOTETomek-Based Re-sampling for Personality Recognition," IEEE Access, vol. 7, pp. 129678–129689, 2019, doi: 10.1109/ACCESS.2019.2940061.
E. Utami, I. Oyong, S. Raharjo, A. Dwi Hartanto, and S. Adi, "Supervised learning and re-sampling techniques on DISC personality classification using Twitter information in Bahasa Indonesia," Applied Computing and Informatics, 2021, doi: 10.1108/ACI-03-2021-0054.
A. Nurhopipah and C. Magnolia, “Perbandingan Metode Resampling Pada Imbalanced Dataset Untuk Klasifikasi Komentar Program Mbkm,” JUPIKOM (Jurnal Publikasi Ilmu Komputer Dan Multimedia), vol. 1, no. 2, 2022, doi: 10.55606/jupikom.v2i1.862.
L. D. Cahya, A. Luthfiarta, J. I. T. Krisna, S. Winarno, and A. Nugraha, “Improving Multi-label Classification Performance on Imbalanced Datasets Through SMOTE Technique and Data Augmentation Using IndoBERT Model,” Jurnal Nasional Teknologi dan Sistem Informasi, vol. 9, no. 3, pp. 290–298, Jan. 2024, doi: 10.25077/teknosi.v9i3.2023.290-298.
Riccosan and K. E. Saputra, “Multilabel multiclass sentiment and emotion dataset from indonesian mobile application review,” Data Brief, vol. 50, Oct. 2023, doi: 10.1016/j.dib.2023.109576.
O. I. Gifari, M. Adha, I. Rifky Hendrawan, F. Freddy, and S. Durrand, “Analisis Sentimen Review Film Menggunakan TF-IDF dan Support Vector Machine,” JIFOTECH (Journal Of Information Technology), vol. 2, no. 1, 2022, doi: 10.46229/jifotech.v2i1.330.
V. W. D. Thomas and F. Rumaisa, "Analisis Sentimen Ulasan Hotel Bahasa Indonesia Menggunakan Support Vector Machine dan TF-IDF," Jurnal Media Informatika Budidarma, vol. 6, no. 3, p. 1767, Jul. 2022, doi: 10.30865/mib.v6i3.4218.
A. F. Rahman, “Klasifikasi Tweet di Twitter dengan Menggunakan Metode K-Nearest Neighbor,” Jurnal Sistim Informasi dan Teknologi, pp. 64–69, Mar. 2022, doi: 10.37034/jsisfotek.v4i2.125.
A. Rizki Bramantyo and A. R. Pratama, “Analisis Sentimen Kebijakan Protokol Kesehatan Pada Masa Pandemi Di Media Sosial Facebook dengan Crowdtangle,” Jurnal Sains Komputer & Informatika (J-SAKTI, vol. 6, no. 2, pp. 947–960, 2022, doi: 10.30645/j-sakti.v6i2.505.
S. Faira Huwaida, R. Kusumawati, B. Isnaini, P. Korespondensi, and R. Artikel, “Analisis sentimen komentar youtube terhadap pemindahan ibu kota negara menggunakan metode Naïve Bayes,” Jambura Journal of Informatics, vol. 6, no. 1, pp. 26–39, 2024, doi: 10.37905/jji.v6i1.24718.
J. E. Br Sinulingga and H. C. K. Sitorus, “Analisis Sentimen Opini Masyarakat terhadap Film Horor Indonesia Menggunakan Metode SVM dan TF-IDF,” Jurnal Manajemen Informatika (JAMIKA), vol. 14, no. 1, pp. 42–53, Feb. 2024, doi: 10.34010/jamika.v14i1.11946.
E. F. Swana, W. Doorsamy, and P. Bokoro, "Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset," Sensors, vol. 22, no. 9, May 2022, doi: 10.3390/s22093246.
Y. A. Sir and A. H. H. Soepranoto, “Pendekatan Resampling Data Untuk Menangani Masalah Ketidakseimbangan Kelas,” Jurnal Komputer dan Informatika, vol. 10, no. 1, pp. 31–38, Mar. 2022, doi: 10.35508/jicon.v10i1.6554.
R. A. Danquah, "Handling Imbalanced Data: A Case Study for Binary Class Problems," Oct. 2020, doi: 10.6084/m9.figshare.13082573.v2.
A. J. Dahur, A. Wahyul Syafei, and T. Prahasto, "Analysis of Visitor Review Data Using Lexicon Based, Support Vector Machine, Random Forest in Determining the Priority Scale of Building Labuan Bajo Tourism Objects," in E3S Web of Conferences, EDP Sciences, Nov. 2023. doi: 10.1051/e3sconf/202344802043.
G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard, "Balancing Training Data for Automated Annotation of Keywords: a Case Study," in II Brazilian Workshop on Bioinformatics, 2003.
K. Marzuki, L. Ganda Rady Putra, H. Hairani, L. Zazuli Azhar Mardedi, and J. Ximenes Guterres, "Performance Improvement of The Random Forest Method Based on Smote-Tomek Link on Lombok Tourism Analysis Sentiment," Jurnal Bumigora Information Technology (BITe), vol. 5, no. 2, pp. 151–158, 2023, doi: 10.30812/bite/v5i1.3166.
K. Rahayu, V. Fitria, D. Septhya, R. Rahmaddeni, and L. Efrizoni, “Klasifikasi Teks untuk Mendeteksi Depresi dan Kecemasan pada Pengguna Twitter Berbasis Machine Learning,” MALCOM: Indonesian Journal of Machine Learning and Computer Science, vol. 3, no. 2, pp. 108–114, Sep. 2023, doi: 10.57152/malcom.v3i2.780.
A. Baita, I. A. Prasetyo, and N. Cahyono, “Hyperparameter Tuning On Random Forest For Diagnose Covid-19,” JIKO (Jurnal Informatika dan Komputer), vol. 6, no. 2, Aug. 2023, doi: 10.33387/jiko.v6i2.6389.
Copyright (c) 2024 Dhiaka Shabrina Assyifa, Ardytha Luthfiarta

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.