Semi-supervised Learning Models for Sentiment Analysis on Marketplace Dataset

Sentiment analysis aims to categorize opinions using an annotated corpus to train the model. However, building a high-quality, fully annotated corpus takes a lot of effort, time, and expense. The semi-supervised learning technique efficiently adds training data automatically from unlabeled data. The labeling process, which requires human expertise and requires time, can be helped by an SSL approach. This study aims to develop an SSL-Model for sentiment analysis and to compare the learning capabilities of Naive Bayes (NB) and Random Forest (RF) in the SSL. Our model attempts to annotate opinion documents in Indonesian. We use an ensemble multi-classifier that works on unigrams, bigrams, and trigrams vectors. Our model test uses a marketplace dataset containing rating comments scrapping from Shopee for smartphone products in the Indonesian Language. The research started with data preparation, vectorization using TF-IDF, feature extraction, modeling using Random Forest (RF) and Naïve Bayes (NB), and evaluation using Accuracy and F1-score. The performance of the NB model outperformed previous research, increasing by 5,5%. The conclusion is that SSL performance highly depends on the number of training data and the compatibility of the features or patterns in the document with machine learning. On our marketplace dataset, better to use Random Forest.


I. INTRODUCTION
Sentiment analysis is part of Natural Language Processing (NLP) which aims to categorize opinions into positive, negative, or neutral sentiments. The benefits of sentiment analysis are widely felt, for example, obtaining sentiment information related to hotels [1], airlines [2], films [3], political events [4], and so on. The results of sentiment classification in a set of documents can be summarized to measure customer satisfaction with the services provided. For example, in the sentence, "The plot of this film is not surprising... The actors are not able to reflect the figure of Superman!!". The terms "not surprising" and "not able" reflect negative sentiments. In supervised sentiment analysis, classification into positive or negative is the main task of machine learning. In supervised sentiment analysis, machine learning will process a training dataset D is equal to {d1, d2, …, dn} and its associated label Y is equal to {y1, y2, . . ., yn} and learn the function f(D; p1, p2, ...) → Y, where p1 and p2 are model parameters. This method is effective for analyzing sentiment, but it requires a huge amount of data that has been categorized. In order to develop highquality datasets, it is necessary for professionals to gather and assign labels to the data. This dataset is going to be read by machine learning in order to train a classification model.
Most sentiment analysis study requires a fully labeled corpus to prepare the model. The expert determines the label in the corpus. However, building a fully labeled corpus with high quality takes a lot of effort, time, and expense, but manually labeling the data can be a strenuous task. Several studies explain that semi-supervised learning (SSL) can be a method that is faster, cheaper, and has high performance for labeling opinion datasets, such as [5]- [7] have solved the difficulty of manual labeling using semisupervised learning (SSL). Semi-supervised learning study using IMDB datasets is [8]. In [8], a semi-supervised algorithm using deep neural networks with different settings divided the IMDB dataset into 4000 training data and 36000 unlabeled data. Their trials obtained accuracy ranging from 81%-82%, not much different from the baseline (82%). Various types of semi-supervised learning provide better accuracy in research [9] [5]. AraSenCorpus in [5] is a semi-supervised framework to annotate a large Arabic text corpus using small manually annotated tweets. This model used the FastText and LSTM deep learning classifier to expand the annotated corpus. In English documents, Balakrishnan proposes SSL uses a Support Vector Machine, Random Forest, and the Naïve Bayes method. In their research, Random Forest reaches F1-score equal to 73.8%, Cohen's Kappa is equal to 52.2% for  [6]. Alahmary proposes a semi-automatic approach to annotating the Saudi dialect tweets dataset. Their model's accuracy achieved by the Naïve Bayes classifier was 83%. Their model also uses three deep learning classifiers: convolutional neural network (CNN), long short-term memory (LSTM), and bidirectional long short-term memory (Bi-LSTM). In their study SVM was used as the baseline for comparison. Overall, the performance of the deep learning classifiers, especially CNN exceeded SVM. CNN outperformed the other classifiers with the highest accuracy of 87% [10].
Our research aims to create an SSL model for sentiment classification with a slight decrease in accuracy and F1-score between baseline conditions and convergent (final) conditions. So, we used several strategies to find the SSL-Model. Continuing our previous research in [11][12], we introduce an SSL model for annotating corpus using Naïve Bayes and Random Forest for the classifier model. In our SSL, we use several classifiers that work together but independently to expand the annotated corpus. Each classifier works in one type of tokenization. The first classifier works on unigrams, the second classifier works on bigrams, and the third classifier works on trigrams. The research question is whether the combination of TF-IDF and Random Forest can maintain their accuracy when used in the SSL model, compared to the baseline model. We also compared the Random Forest with Naïve Bayes as a machine learning in SSL. The next question is whether the number of annotated datasets for training in semisupervised learning significantly affects the model's accuracy. We used the Marketplace dataset (in Indonesian Languages) to test the model. This paper contains: section 1 presents an introduction, research objectives, and related works; section 2 describes the data collection, pre-processing, vectorization, modeling, and validation methods. Section 3 contains results and discussion, and section 4 contains conclusions.

II. METHOD
In this section, we will go through the data preparation methods, vectorization, feature extraction, modeling with Random Forest, model validation, model architecture, and pseudocode for the model.

A. Data Collection
For experiments, we used two marketplace datasets in Indonesian languages. The datasets containing scrapped shop rating comments from Shopee for smartphone products: MarketData1 and MarketData2, consist of 8523 and 5421 document reviews. MarketData1 is a data set for sentiment classification that has been manually labeled positive, neutral, and negative. MarketData2 is a data set for binary sentiment classification that has been manually labeled positive and negative.

B. Data Cleaning and Preprocessing
Marketplace datasets need to be analyzed consisting of words, numbers, and special symbols. Some processes for structuring the data go through several stages, such as tokenizing (unigram, bigram, and trigram), converting to a small case, removing a number, removing stop words, removing all non-alphabetic characters and punctuation, and stemming.

C. Vectorization
TF-IDF is used to calculate the weight of each word in the corpus. A document's term frequency can be calculated by taking the total number of terms in the document and dividing that total by the total number of terms in the document. IDF is the notation used to distribute the terms throughout document D. The TF value increases in proportion to the frequency of a word's appearances in the document; conversely, the IDF value increases in proportion to the decreasing frequency of the word's appearances. The term weights resulting from the TF-IDF weighting are converted into vector data. In very large documents, the features form a large dimensional matrix because each word that appears in the document is represented by its score [13]. TF-IDF Vectorizer used for sentiment analysis in research [14]- [16].

D. Ensemble Multi Classifier
We use Random Forest (RF) to build the SSL model. Random Forest creates multiple trees based on bootstrapped data samples and splitting nodes using the best split among a random subset of features selected at every node, then combines the predictions in Fig. 1. Random Forest used for sentiment classification in [18]- [20]. In this research, the parameter of Random Forest was set using some estimators=200.
Naive Bayes is used in many sentiment analysis studies in Indonesian [20]- [22] and in movie commentary datasets in [23]. Naive Bayes is already known as machine learning which is widely used in sentiment analysis and produces high accuracy. Bayes' rule is presented in Equation (1).
Where, the P(y) variable is a probability y is true, the P(X) variable is a probability of the X variable is true, the P(y|X) variable is a probability of the y to be true if X variable is true, and the P(X|y) variable is a probability of the X is true if y variable is true. Naive Bayes is a suitable method for binary and multiclass classification. This method applies a supervised classification technique by assigning class labels to instances using conditional probabilities. Conditional probability is the probability of an event occurring when another event has already occurred.

E. Validation
Performance measurement for SSL model tested using a confusion matrix. The confusion matrix compares the actual and prediction results (Table 1). This study uses two measurements to validate the model: Accuracy and F1-score. Accuracy in Equation (2) is a great measure but only for symmetric datasets where values of false positives and false negatives are almost the same.
F1-score is the weighted average of Precision and Recall in Equation (3). In unequal class distributions, the F1 score is usually more useful than the accuracy Precision is the degree of match between the information requested by the user and the answers given by the system. Precisionformulated in (4) is the ratio of correctly predicted positive to the total predicted positive.

Precision = TP / TP + FP
Recall (Sensitivity) is the system's success rate in retrieving information. Recall presented in Equation (5).

F. Semi-Supervised Learning Architecture
The proposed SSL model was developed from previous research in [11] [12]. The difference is in the type of machine learning, the datasets, the voting mechanism to determine the class for the data, and the more varied threshold values. The architectural model shown in Figure 1 starts by reading the annotated input dataset. The proposed SSL model shown in Figure 1 began with reading the annotated input dataset. The annotated dataset is clean after pre-processing and divided into unlabelled data, data training, and data testing. TF-IDF vectorization processes the data training into three vectors: unigram, bigram, and trigram vector. The vectors used to build models using Random Forest (RF) and Naïve Bayes (NB) (in the next experiment). The result is three models that work separately (using the ensemble stacking mechanism). The three models were used to annotate Unlabeled Data. TF-IDF also vectorizes unlabeled data. Unlabeled data annotated by each model. The resulting Pseudo Labels are three classified documents. A label is considered high confidence if it is supported by the sum of weight divided by the total weight of several models and higher than a threshold. Threshold numbers are used to select whether the annotated data (with pseudo-labels) is worthy of being training data. The high-confidence document will be integrated with the Training Data. The document will be re-labeled in the next iteration if categorized as low confidence.  Iterations in our SSL model run ten times or until the Unlabeled Data runs out. The model's output is Data Training (DT) which humans and machines have labeled. Fig 2 is the pseudocode of our model. The pseudocode begins with setting the threshold number. Lines 2-4 are about input training data (DT), testing data (DTest), and unlabeled data (UN). DataTraining, Data testing, and Unlabeled dataset tokenized to unigram, bigram, and trigram using TF-IDF methods (lines 6-8). The classifier models were formed using three training sets and machine learning (RF or NB) on lines 10-12. The annotation process was on lines 14-16. Lines 18-28 are the test process for each new annotated data whether it meets to become training data. The process begins by checking whether the new annotated data tends to be positive, negative, or neutral based on the pseudo-label weight (lines 26-28). If it is more than the threshold, then it is feasible to become training data. If not, it will be retested in the next iteration.

A. Testing the SSL Model using Market Dataset 1.
For an experiment, data are coded to D1, D2, D3, and D4. We randomly divided the dataset into training data and test data in a 9:1 ratio. The number of labeled test data for each D1, D2, D3, and D4 is 850 (approximately 10% of all documents). The number of labeled training data (annotated dataset) in D1, D2, D3, and D4 are 1700, 850, 425, and 212, respectively. The leftover training data was used as the unlabeled data set. The baseline model in D1, D2, D3, and D4 was built with training data only. The baseline model was tested using labeled test data. In Table 2, we display the accuracy and F1-score of the baseline and semi-supervised learning (SSL) model in D1, D2, D3, and D4 under different numbers of thresholds, respectively.
There is some knowledge gained from 64 SSL models. First, the baseline classification results show that the accuracy score and F1 score are directly proportional to the number of training data instances. The accuracy and F1-score at the baseline of the Random Forest models are higher than that of Naïve Bayes. Second, the results of semi-supervised learning classification show that accuracy and F1-score also tend to be linear with the number of training data instances but inversely proportional to the threshold. The threshold strongly influences the SSL accuracy rate. A low threshold provides high accuracy and a high F1 score. The reason is that a low threshold will produce more pseudo-labeled datasets than a high threshold, so the classifier model formed in the next iteration will be smarter than the model formed by a few pseudo-labeled datasets. In general, the accuracy and F1-score of the SSL Random Forest model are higher than that of Naïve Bayes. Third, the difference between the baseline and the SSL model's average accuracy in Random Forest is 0.05, more significant than the Naive Bayes model, 0.06. The difference between the baseline F1-score and the average F1-score of the SSL model in Random Forest is 0.04, which is better than the Naive Bayes SSL model, which is 0.05. This means that Random Forest is better at maintaining the accuracy of the SSL process than Naive Bayes. There is even some accuracy, and the F1 score of the SSL-Random Forest model is higher than the baseline (highlighted).

B. Testing the SSL Model using Market Dataset 2
As same as the previous experiment, four conditions of the Market Dataset 2 are coded D1, D2, D3, and D4. We also divided the dataset into training data and test data in a 9:1 ratio. The number of labeled test data for each D1, D2, D3, and D4 is 540 (10% of all Market Dataset 2). The number of labeled training data in D1, D2, D3, and D4 are 1080, 540, 270, and 135, respectively. The leftover training data is used as the unlabeled data set. In Table 3, we display the accuracy and F1-score of the baseline and semi-supervised learning (SSL) model in D1, D2, D3, and D4 under different numbers of thresholds, respectively. Table 3 describes 64 SSL-model operations using Market Dataset 2 and gives different results from Market Dataset 1. First, the baseline classification results show that the accuracy score and F1 score are not directly proportional to the number of training data instances. In D2, the accuracy and F1-score are lower than in D3 and D4. The accuracy and F1-score at baseline of the Naïve Bayes models are higher than that of Random Forest. Second, the results of semi-supervised learning classification show that accuracy and F1-score also tend to be linear with the number of training data instances but inversely proportional to the threshold. The threshold also influences the SSL accuracy rate. A low threshold provides high accuracy and a high F1 score because a low threshold will produce more pseudo-labeled datasets than a high threshold. In general, the accuracy and F1-score of the SSL Naïve Bayes model are higher than the Random Forest model. Third, the difference between the baseline and the SSL model's average accuracy in Naïve Bayes is the same as in Random Forest (0,07). The difference between the baseline F1-score and the average F1-score of the SSL model in Naïve Bayes is 0.08, which is better than the Random Forest SSL model, which is 0.1. This means that in the Market Dataset 2, Naïve Bayes is better at maintaining the accuracy of the

C. Comparison with Previous Research
We compare our SSL model with previous studies of the same type of machine learning (NB and RF). The performance of the NB model outperformed Balakrishnan et al.'s F1-score (70.5%). In this study, on the Market Dataset 2, the F1-score results reached 76% for NB. It also outperformed the accuracy from [24], whose F1-score results were 57,16 (NB) and 59,34 (RF). In this study, on the Market Dataset 2, the F1-score results reached 0,71 for RF and 0,76 for NB.

IV. CONCLUSION
This study presents an SSL model for sentiment analysis to label Market Data 1 and Market Data 2. In this study, on the Market Dataset 2, the F1-score results reached 0,76 for NB and 0,71 for RF. The results of this study provide several conclusions. The conclusion is that SSL performance highly depends on the number of training data and the compatibility of the features or patterns in the document with machine learning. On Market Data 1, a dataset with three classes, it is better to use Random Forest (F1-score of RF 0,65, and 0,62 for NB). In the Market Data 2 dataset, which consists of two classes, it is better to use Naïve Bayes (F1-score of RF 0,71, and 0,76 for NB). The future research is a sentiment analysis test using SSL on several other datasets and other types of machine learning.