Semi-supervised Learning Models for Sentiment Analysis on Marketplace Dataset


Abstract
Sentiment analysis aims to categorize opinions using an annotated corpus to train the model. However, building a high-quality, fully annotated corpus takes a lot of effort, time, and expense. The semi-supervised learning technique efficiently adds training data automatically from unlabeled data. The labeling process, which requires human expertise and requires time, can be helped by an SSL approach. This study aims to develop an SSL-Model for sentiment analysis and to compare the learning capabilities of Naive Bayes (NB) and Random Forest (RF) in the SSL. Our model attempts to annotate opinion documents in Indonesian. We use an ensemble multi-classifier that works on unigrams, bigrams, and trigrams vectors. Our model test uses a marketplace dataset containing rating comments scrapping from Shopee for smartphone products in the Indonesian Language. The research started with data preparation, vectorization using TF-IDF, feature extraction, modeling using Random Forest (RF) and Naïve Bayes (NB), and evaluation using Accuracy and F1-score. The performance of the NB model outperformed previous research, increasing by 5,5%. The conclusion is that SSL performance highly depends on the number of training data and the compatibility of the features or patterns in the document with machine learning. On our marketplace dataset, better to use Random Forest.
Downloads
References
H. Imaduddin, Widyawan, and S. Fauziati, “Word Embedding Comparison For Indonesian Language Sentiment Analysis,” Proceeding - 2019 International Conference of Artificial Intelligence and Information Technology, ICAIIT 2019, pp. 426–430, 2019, doi: 10.1109/ICAIIT.2019.8834536.
R. Monika, S. Deivalakshmi, and B. Janet, “Sentiment Analysis of US Airlines Tweets Using LSTM/RNN,” Proceedings of the 2019 IEEE 9th International Conference on Advanced Computing, IACC 2019, pp. 92–95, 2019, doi: 10.1109/IACC48062.2019.8971592.
A. H. Abdulhafiz, “Novel opinion mining system for movie rviews in Turkish,” International Journal of Intelligent Systems and Applications in Engineering, vol. 8, no. 2, pp. 94–101, 2020, doi: 10.18201/ijisae.2020261590.
D. F. Budiono, A. S. Nugroho, and A. Doewes, “Twitter sentiment analysis of DKI Jakarta’s gubernatorial election 2017 with predictive and descriptive approaches,” Proceedings - 2017 International Conference on Computer, Control, Informatics and its Applications: Emerging Trends In Computational Science and Engineering, IC3INA 2017, vol. 2018-Janua, pp. 89–94, 2017, doi: 10.1109/IC3INA.2017.8251746.
A. Al-Laith, M. Shahbaz, H. F. Alaskar, and A. Rehmat, “Arasencorpus: A semi-supervised approach for sentiment annotation of a large arabic text corpus,” Applied Sciences (Switzerland), vol. 11, no. 5, 2021, doi: 10.3390/app11052434.
V. Balakrishnan, P. Y. Lok, and H. Abdul Rahim, “A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews,” Journal of Supercomputing, vol. 77, no. 4, pp. 3795–3810, 2021, doi: 10.1007/s11227-020-03412-w.
C. R. Aydln and T. Güngör, “Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques,” Natural Language Engineering, vol. 27, no. 4, pp. 455–483, 2021, doi: 10.1017/S1351324920000200.
V. L. Shan Lee, K. H. Gan, T. P. Tan, and R. Abdullah, “Semi-supervised Learning for Sentiment Classification using Small Number of Labeled Data,” Procedia Computer Science, vol. 161, pp. 577–584, 2019, doi: 10.1016/j.procs.2019.11.159.
V. L. Shan Lee, K. H. Gan, T. P. Tan, and R. Abdullah, “Semi-supervised Learning for Sentiment Classification Using Small Number of Labeled Data,” Procedia Computer Science, vol. 161, pp. 577–584, 2019, doi: 10.1016/j.procs.2019.11.159.
R. Alahmary and H. Al-Dossari, “A semiautomatic annotation approach for sentiment analysis,” Journal of Information Science, 2021, doi: 10.1177/01655515211006594.
A. Sasmito, H. Basiron, N. Fazilla, and A. Yusof, “Semi-supervised Learning for Sentiment Classification with Ensemble Multi-classifier Approach,” International Journal of Advances in Intelligent Informatics, vol. 8, no. 3, pp. 1–13, 2022, [Online]. Available: https://ijain.org/index.php/IJAIN/article/view/929%7Cto_array%3A0.
N. H. Cahyana, S. Saifullah, Y. Fauziah, A. S. Aribowo, and R. Drezewski, “Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 10, pp. 147–151, 2022, doi: 10.14569/ijacsa.2022.0131020.
S. Mitra and M. Jenamani, “SentiCon: A Concept Based Feature Set for Sentiment Analysis,” in 2018 13th International Conference on Industrial and Information Systems, ICIIS 2018 - Proceedings, 2018, no. 978, pp. 246–250, doi: 10.1109/ICIINFS.2018.8721408.
I. P. Windasari, F. N. Uzzi, and K. I. Satoto, “Sentiment analysis on Twitter posts: An analysis of positive or negative opinion on GoJek,” Proceedings - 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering, ICITACEE 2017, vol. 2018-Janua, pp. 266–269, 2017, doi: 10.1109/ICITACEE.2017.8257715.
A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “An Evaluation of Preprocessing Steps and Tree-based Ensemble Machine Learning for Analysing Sentiment on Indonesian YouTube Comments,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 5, pp. 7078–7086, 2020, doi: 10.30534/ijatcse/2020/29952020.
A. N. Farhan and M. L. Khodra, “Sentiment-specific word embedding for Indonesian sentiment analysis,” Proceedings - 2017 International Conference on Advanced Informatics: Concepts, Theory and Applications, ICAICTA 2017, 2017, doi: 10.1109/ICAICTA.2017.8090964.
M. Aufar, R. Andreswari, and D. Pramesti, “Sentiment Analysis on Youtube Social Media Using Decision Tree and Random Forest Algorithm: A Case Study,” 2020 International Conference on Data Science and Its Applications, ICoDSA 2020, 2020, doi: 10.1109/ICoDSA50139.2020.9213078.
M. A. Fauzi, “Random forest approach fo sentiment analysis in Indonesian language,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 1, pp. 46–50, 2018, doi: 10.11591/ijeecs.v12.i1.pp46-50.
Y. Hedge and S. K. Padma, “Sentiment Analysis using Random Forest Ensemble for Mobile Product Review in Kannada,” 2017, doi: 10.1109/IACC.2017.151.
S. Khomsah, “Naive Bayes Classifier Optimization on Sentiment Analysis of Hotel Reviews,” Jurnal Penelitian Pos dan Informatika, vol. 10, no. 2, p. 157, 2020, doi: 10.17933/jppi.2020.100206.
R. A. Maisal, A. N. Hidayanto, N. F. Ayuning Budi, Z. Abidin, and A. Purbasari, “Analysis of sentiments on Indonesian YouTube video comments: case study of the Indonesian government’s plan to move the capital city,” in 1st International Conference on Informatics, Multimedia, Cyber and Information System, 2019, pp. 121–124, doi: 10.1109/ICIMCIS48181.2019.8985228.
A. N. Muhammad, S. Bukhori, and P. Pandunata, “Sentiment analysis of positive and negative of YouTube comments using naïve bayes-support vector machine (NBSVM) classifier,” in International Conference on Computer Science, Information Technology, and Electrical Engineering, 2019, vol. 1, pp. 199–205, doi: 10.1109/ICOMITEE.2019.8920923.
R. Novendri, A. S. Callista, D. N. Pratama, and C. E. Puspita, “Sentiment analysis of YouTube movie trailer comments using naïve bayes,” Bulletin of Computer Science and Electrical Engineering, vol. 1, no. 1, pp. 26–32, 2020, doi: 10.25008/bcsee.v1i1.5.
H. B. B. B and M. das G. V. Nunes, “Semi-supervised Sentiment Annotationof Large Corpora,” Computational Processing of the Portuguese Language, pp. 385–395, 2018, doi: 10.1007/978-3-319-99722-3.
Copyright (c) 2022 Agus Sasmito Aribowo, Wisnalmawati, Yunie Herawati

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with International Journal of Artificial Intelligence & Robotics (IJAIR) agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.