Comparison of the Effect of Word Normalization on Naïve Bayes Classifier and K-Nearest Neighbor Methods for Sentiment Analysis


Abstract
In the pre-processing stage of sentiment analysis, there are several essential steps, one of which is word normalization, which is converting non-standard words into standard words. However, some research on sentiment analysis generally does not go through the word normalization stage, which can affect accuracy. This study aims to compare the effect of word normalization on the Naive Bayes Classifier and K-Nearest Neighbor methods for sentiment analysis of public opinion on the Agency Social Security Administrator for Health (BPJS Kesehatan). Gathering the data, labeling it, pre-processing it with two different scenarios, word weighting it with TF-IDF, classifying it using Naive Bayes Classifier and K-Nearest Neighbor, and lastly computing the accuracy of the Confusion Matrix are the steps that are involved. As a result of these discovered fact, the most superior accuracy results are obtained by the Naive Bayes Classifier method 1st scenario, namely by using word normalization at the pre-processing stage and getting an accuracy of 87.14%. This research shows that the Naive Bayes Classifier method with word normalization produces better accuracy, precision, recall, and F1-score.
References
Hana, K. M., Adiwijaya, Al Faraby, S., & Bramantoro, A. (2020). Multi-label Classification of Indonesian Hate Speech on Twitter Using Support Vector Machines. August. ResearchGate. https://www.researchgate.net/publication/347154120
Aribowo, A. S. (2018). Analisis Sentimen Publik pada Program Kesehatan Masyarakat Menggunakan Twitter Opinion Mining. Seminar Nasional Informatika Medis (Snimed), 17–23.
Rasyada, I., Setyowati, Y., Barakbah, A., & Tafaqquh Fiddin, M. (2020). Sentiment Analysis of BPJS Kesehatan's Services Based on Affective Models. IEEE Xplore, 549–556.
Kusumawati, N., Maspupah, U., F, D. S. R., & Hamzah, A. (2022). Comparing Algorithm for Sentiment Analysis in Healthcare and Social Security Agency ( BPJS Kesehatan ). Techno Nusa Mandiri: Journal of Computing and Information Technology, 19(1), 31–37. https://doi.org/10.33480/techno.v19i1.3167.
Karim, A. (2021). Analisis Sentimen pada Komentar Sosial Media Instagram Layanan Kesehatan BPJS Menggunakan Naïve Bayes Classifier.
Fahlapi, R., & Rianto, Y. (2020). Twitter Comment Predictions on Dues Changes BPJS Health in 2020. Jurnal dan Penelitian Teknik Informatika, 5(1), 170–183. https://doi.org/10.33395/sinkron.v5i1.10588
Saputra, Irwansyah & Kristiyanti, Dinar Ajeng. (2022). Machine Learning untuk Pemula. Penerbit Informatika.
Rish, I. (2014). An Empirical Study of The Naïve Bayes Classifier. T.J Watson Research Center. https://www.researchgate.net/publication/228845263
Permana, T., Siregar, A. M., Masruriyah, A. F. N., & Juwita, A. R. (2020). Perbandingan Hasil Prediksi Kredit Macet pada Koperasi. Conference on Innovation and Application of Science and Technology, 737–746
Antinasari, P., Perdana, R. S., & Fauzi, M. A. (2017). Analisis Sentimen tentang Opini Film pada Dokumen Twitter Berbahasa Indonesia Menggunakan Naive Bayes dengan Perbaikan Kata Tidak Baku. Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, 1(12), 1733–1741. http://j-ptiik.ub.ac.id
Mega, P., Dharmapatni, N., Luh, N., & Merawati, P. (2020). Penerapan Algoritma Support Vector Machine dalam Sentimen Analisis Terkait Kenaikan Tarif BPJS Kesehatan. Jurnal Bumigora Information Technology, 2(2), 105–112. https://doi.org/10.30812/bite.v2i2.904
Widyawati, & Sutanto. (2019). Perbandingan Algoritma Naïve Bayes dan Support Vector Machine. Jurnal Sains & Teknologi, 3(2), 178–194.
Pradana, A. W., & Hayaty, M. (2019). The Effect of Stemming and Removal of Stopwords on The Accuracy of Sentiment Analysis on Indonesian-Language Texts. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control Journal homepage, 4(3). http://kinetik.umm.ac.id
Jumeilah, F. S. (2017). Penerapan Support Vector Machine (SVM) untuk Pengkategorian Penelitian. Jurnal Rekayasa Sistem dan Teknologi Informasi, 1(1), 19–25. http://jurnal.iaii.or.id
Fitriyah, N., Warsito, B., & Maruddani, D. A. I. (2020). Analisis Sentimen Gojek pada Media Sosial Twitter dengan Klasifikasi Support Vector Machine. Jurnal Gaussian, 9(3), 376–390. https://ejournal3.undip.ac.id/index.php/gaussian/
Najiyah, I., & Haryanti, I. (2021). Sentimen Analisis Covid-19 dengan Metode Probabilistic Neural Network dan TF-IDF. Jurnal Responsif, 3(1), 100–111. http://ejurnal.ars.ac.id/index.php/jti
Anugerah, F., & Djunaidy, A. (2017). Improving The Performance of Repeated Character Pre-processing in Recognizing Words in The Indonesian Sentiment Classification. Journal of Basic and Applied Scientific Research, 7(9), 1–9. www.textroad.com
Putra, M. F., Herdiani, A., & Puspandari, D. (2019). Analisis Pengaruh Normalisasi , TF-IDF , Pemilihan Feature-set terhadap Klasifikasi Sentimen Menggunakan Maximum Entropy ( Studi Kasus : Grab dan Gojek ). e-Proceeding of Engineering, 6(2), 8520–8529.
Jayashree, R., & Murthy, K. S. (2014). Effect of Stop Word Removal on The Performance of Naïve Bayesian Methods for Text Classification in The Kannada Language. Journal Artificial Intelligence and Soft Computing, 4, 264–282.
Meisya, F. (2013) Perancangan Sistem Temu Balik Informasi dengan Metode Pembobotan Kombinasi TF-IDF untuk Pencarian Dokumen Berbahasa Indonesia. Jurnal Sistem dan Teknologi Informasi, 1(1). https://jurnal.untan.ac.id/index.php/justin/article/view/1319/1288
Septian, J. A., Fahrudin, T. M., & Nugroho, A. (2019). Analisis Sentimen Pengguna Twitter Terhadap Polemik Persepakbolaan Indonesia Menggunakan Pembobotan TF-IDF dan K-Nearest Neighbor. Journal of Intelligent System and Computation. https://www.researchgate.net/publication/335826349
Nugroho, M. A., & Santoso, H. A., (2016). Klasifikasi Dokumen Komentar pada Situs Youtube Menggunakan Algoritma K-Nearest Neighbor. http://eprints.dinus.ac.id/18746/
Rahman, H. (2021). Klasifikasi Sentimen Masyarakat terhadap Layanan Badan Penyelenggara Jaminan Sosial (BPJS) Kesehatan di Twitter Menggunakan Metode K-Nearest Neighbor.
Copyright (c) 2023 Novrido Charibaldi, Atania Harfiani, Oliver Samuel Simanjuntak

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.