Emotion Detection in Twitter Social Media Using Long Short-Term Memory (LSTM) and Fast Text


 
 
 
Emotion detection is important in various fields such as education, business, employee recruitment. In this study, emotions will be detected with text that comes from Twitter because social media makes users tend to express emotions through text posts. One of the social media that has the highest user growth rate in Indonesia is Twitter. This study will use the LSTM method because this method is proven to be better than previous studies. Word embedding fast text will also be used in this study to improve Word2Vec and GloVe that cannot handle the problem of out of vocabulary (OOV). This research produces the best accuracy for each word embedding as follows, Word2Vec produces an accuracy of 73,15%, GloVe produces an accuracy of 60,10%, fast text produces an accuracy of 73,15%. The conclusion in this study is the best accuracy was obtained by Word2Vec and fast text. The fast text has the advantage of handling the problem of out of vocabulary (OOV), but in this study, it cannot improve the accuracy of word 2vec. This study has not been able to produce very good accuracy. This is because of the data used. In future works, to get even better results, it is expected to apply other deep learning methods, such as CNN, BiLSTM, etc. It is hoped that more data will be used in future studies. 
 
 
 



II. METHOD
The flow of the research method in this study can be seen in Fig. 1, and the research method contains the steps taken in the study, from data collection to testing, so that finally, the research conclusions can be drawn.

A. Data Collection and Labelling Data
The dataset used in this study came from Twitter, which was taken from several influencers as the primary data and trending data for one week as supporting data. Influencers were chosen because becoming influencers required a self-disclosure process in social media, where one form of self-disclosure is to express emotions [26]. The data was taken by using the web scraping technique using python programming language with selenium library and producing a dataset with *.csv/*.xlsx file format. The number of data is 1304 with 250 for happy data, 250 for sad data, 200 for scared data, 200 for disgust data, 204 for angry data, 201 for shocking data. Next, data were labeled by students who actively use Twitter. This was done so that the data obtained for the training model was not subjective. An example of the dataset in this study can be seen in Table I.

B. Preprocessing
The first step after labeling data is preprocessing. Data preprocessing often affects the performance of machine learning, making it more effective [25]. Preprocessing is useful for cleaning data from noise and uninformative parts that are not needed so that the text is ready to be classified [10]. Preprocessing will make the data clean so that the process of creating word vectors and classifications is more accurate [20]. In this study, preprocessing used was case folding, remove punction, remove a number, tokenizing, stop removal, stemming: • Case Folding is changing all letters in the text to lower case or lowercase all letters [27], with letters received from "a" to "z". Case folding aims to provide standardization in writing. • Remove punctuation is to remove punctuation in the text to reduce the burden on classification processing because it is considered unimportant and includes delimiters, examples of punctuation that are removed by period (.), Comma (,), question mark (?), Slash (/), hashtag (#), exclamation point (!) and others. • Remove numbers is to remove numbers in a text; this deletion is because numbers are considered meaningless and include delimiters, similar to remove punctuation, only different in objects that are deleted. • Tokenizing or tokenization is cutting sentences based on each constituent word, usually cutting based on whitespace such as spaces, tabs, and enter. Each tokenized word is called a token [15]. • Stop word removal or filtering is a process to remove words that have no meaning or meaning but will not change the meaning of these comments [15]. Stop word removal will reduce index size, classification processing time, and noise from data. Usually, stop words are in the form of pronouns and conjunctions, such as "aku", "kamu", "kita", "dan", "atau". In this study, the process of case folding, remove punctuation, remove numbers, and stop words use the Natural Language Toolkit (NLTK) library, with an Indonesian stop word. • Stemming or lemmatization is a process for transforming verbs that adhere to a document into a root word by removing prefixes, suffixes, and insertions-stemming aims to reduce word variants with almost the same meaning in a document improve performance at the information retrieval stage. In this study stemming process will use library sastrawi because the data in this study use Indonesian. The library sastrawi applies the Nazief and Adriani algorithms. Examples of preprocessed data can be seen in Table II. Text after case folding "hore, akhirnya timnas indonesia menang dengan skor 2-0 !!" Text after removing punctuation "hore, akhirnya timnas indonesia menang dengan skor 20" Text after remove number "hore akhirnya timnas indonesia menang dengan skor" Text after tokenizing ["hore", "akhirnya", "timnas", "indonesia", "menang", "dengan", "skor"] Text after stop word removal ["hore", "akhirnya", "timnas", "indonesia", "menang", "skor"] Text after stemming ["hore", "akhir", "timnas", "indonesia", "menang", "skor"] Text after preprocessing ["hore", "akhir", "timnas", "indonesia", "menang", "skor"]

C. Word Embedding
The results of the preprocessing then enter the word embedding process. Word embedding is a technique for mapping words based on an existing dictionary to produce numeric vectors containing real numbers [11]. Research conducted by Utomo (2020) found that the LSTM algorithm has better accuracy when using word embedding of 86.76%, while without word embedding only gets an accuracy of 84.14% [29]. In this study, we will test 3 embedding words, namely, Word2Vec, GloVe, FastText. Word embedding testing is useful so that at the end of the study, conclusions can be drawn, whether FastText can improve accuracy or not. In this study, Word2Vec and FastText are implemented using the Gensim library, while GloVe will be implemented using pre-trained GloVe. Each word embedding uses 100 dimensions.
Word2Vec is one of the word embedding methods introduced by Mikolov et al. In 2013. Word2Vec has advantages in the similarity of word meanings obtained from paying attention to the similarity of words around the target word [16]; the Word2Vec method is very popular because of the advantages. Word2Vec has two techniques, namely the continuous bag of words and the skip-gram model. In this study, the Word2Vec model used is CBOW. Examples of Word2Vec results with a 5dimensional vector with the same sentence as the results of the preprocessing process examples can be seen in Table III. Global vectors for word representation (GloVe) are a word embedding method that relies on the co-occurrence of words or statistics on the occurrence of words in a set of words or corpus, which are captured directly by the model to obtain semantic relationships between words in the corpus. Glo-Ve uses the global matrix factorization method, representing the number of occurrences or frequencies in a corpus [21].
Fast text is a word embedding method developed from Word2Vec. Fast Text has the advantage that it can handle the out of vocabulary problem where this problem cannot be solved by Word2Vec and Glo-Ve word embedding. Fast text learns word representation by paying attention to sub-word information using n-grams into the skip-gram model. This makes fast text capture shorter words and understands word suffixes and prefixes [3]. Examples of FastText results with a 5-dimensional vector with the same sentence as the results of the preprocessing process examples can be seen in Table IV.

D. Long Short-Term Memory (LSTM)
The results of the word embedding then enter into the process of models the LSTM model. LSTM is a deep learning algorithm developed from the RNN architecture. RNN has a vanishing gradient problem; LSTM can solve the problem of vanishing gradient with memory cells and gate units (input gate, forget gate, output gate) so that LSTM can read, store, and update information [22]. In this study, LSTM was implemented using library Keras. LSTM architecture can be seen in Fig. 2.  [17] There are three types of gates on the LSTM: forget gate, input gate, and output gate. The forget gate serves to determine which information is removed from the cell. The input gate functions to determine the value of the input to be updated on state memory. The output gate functions to determine the output based on the input and memory in the cell [29]. LSTM steps are as follows [14]: • The first step, as illustrated in Fig. 3, LSTM, will determine which information should be removed from the cell state; this section is called the forget gate. Fig. 3. Forget Gate [17] This step uses the output from the previous step (ht-1) and input (xt) to be processed with the sigmoid activation function. This process will produce an output in the form of a value of 0 or a value of 1 in cell state Ct-1, a value of 0 means that information will be removed, while a value of 1 means that it will be preserved. According to Hochreiter and Schmidhuber (1997), the forget gate equation is described in equation (1).
• As illustrated in Fig. 4, the second step determines the information to be added to the cell state. This step has two parts; the first, the sigmoid layer, called the input gate layer, which functions to determine which value to update. Next, the tanh layer creates a new candidate value (̌) to be inserted into the cell state. The output from the input gate layer and tanh layer will be combined to update the cell state. According to Hochreiter and Schmidhuber (1997), the equation for the input gate layer and the equation for the value of the candidate context can be seen in equation (2) and equation (3).  [17] • As illustrated in Fig. 5, the third step is updating the old state cell Ct-1 to become the new state cell Ct. In the previous step, the input gate value and the candidate context were obtained. The third step will occur the process of multiplying the old Ct-1 cell state with the ft forget gate value, then the multiplication result of the old Ct-1 cell state with the forget gate value plus the multiplication result of the input gate value with the context candidate value ̌, will produce a new cell state (Ct). The equation for getting a new cell state (Ct), according to Hochreiter and Schmidhuber (1997) can be seen in equation (4).
• As illustrated in Fig. 6, the final step determines the output of the whole process (ht). First, the sigmoid layer uses the previous output (ht-1) and input (xt) to determine the gate output value (Ot), the gate output value (Ot) between 0 and 1, which shows the part that is the output of the cell state, then the cell state (Ct) is changed using the tanh activation function, this cell state gets a value between -1 and 1. The cell state value is then multiplied by the gate output value (Ot) to give the output value (ht). This process equation, according to Hochreiter and Schmidhuber (1997) is described in equation (5) and equation (6). LSTM has a Dense layer; this layer is useful for determining classification results. The number of dense layers used in this study is 6, according to the number of classes used. The dense layer gets input from the final output (ht) of the last order. The Dense Layer formula can be seen in equation (7).
LSTM has several parameters, including the number of units/neurons, activation function, optimization, epoch, number of dropouts. This research will use the number of units/neurons of 50 to 200 units, while the dropout will use 20, 30, and 50. Dropout is useful for disconnecting some connections between units randomly to reduce overfitting significantly. The activation function in this study uses the sigmoid activation function. The optimization used in this study is Adaptive Moment Estimation (ADAM). The number of epochs used in this study was 50 epochs. The loss function used is the Categorical Cross-Entropy Loss function because, in this study, there are more than two classes. This study's division between training data and testing data uses 70% training data and 30% testing data. This is because the number of dividing 70:30 and 80:20 can produce the best model [9]. Testing of neuron parameters and dropout is useful for getting the best architecture.

E. Testing
The last stage is testing according to the architectural design being built and word embedding that is compared, and the test will use a confusion matrix to obtain accuracy, precision, recall, and f1-score values. References [12] confusion matrix is a tool for analyzing a classifier to recognize tuples from different classes properly or not. In this study, a confusion matrix implemented using Sklearn.Metrics. There are four terms in the confusion matrix that represent the results, namely True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN). The confusion matrix illustration table can be seen in Table V. The equations for accuracy, precision, and recall are as follows [12] : • Accuracy is a value that represents the level of closeness between the predicted data and the actual data. The accuracy formula can be seen in equation (8).
where TP is several positive data that is predicted to be correct, TN is several negative data that is predicted to be correct., FP is several negative data predicted was wrong.; FN is several data positives that were predicted to be wrong • Precision is a comparison between data that is predicted to be true positive with data that is predicted to be positive. The precision formula can be seen in equation (9).
• The recall is a comparison between data that is predicted to be true positive with truly positive data. The recall formula can be seen in equation (10).

III. RESULT AND DISCUSSION
Below are the results based on parameters such as the number of units/neurons of 50 to 200 units, while the dropout will use 20, 30, and 50. In this section, the word embedding is compared between Word2Vec, GloVe, and FastText. This aims to see whether FastText can affect the accuracy of results to conclude at the end of the study. The results table can be seen in Table VI,  Table VII, and Tabe VIII.  In the Word2Vec word embedding test table, the highest accuracy is obtained by the 50-unit architecture and 50 dropouts of 0,731458. In the Word2Vec training process, the relationship between loss and loss validation can be seen in Fig. 7(a), while the relationship between accuracy and accuracy validation can be seen in Fig. 7(b). Meanwhile, the results of the confusion matrix test on the Word2Vec architecture can be seen in Table IX.  In the GloVe word embedding test table, the highest accuracy is obtained by the 50-unit architecture and 30 dropouts of 0,601023. In the GloVe training process, the relationship between loss and loss validation can be seen in Fig. 8(a), while the relationship between accuracy and accuracy validation can be seen in Fig. 8(b). Meanwhile, the results of the configuration matrix test on the GloVe architecture can be seen in table X.  Table   TABLE X  RESULT CONFUSION MATRIX LSTM AND GLOVE   happiness sadness  fear  disgust  anger  surprise   happiness  40  9  2  4  8  11  sadness  8  47  4  4  9  10  fear  6  3  40  4  5  4  disgust  5  2  2  39  2  0  anger  3  8  0  4  34  7  surprise  9  6  4  5  8  35 Finally, in the word embedding fast text architecture, the best architecture is obtained with 50 units and a dropout of 50, resulting in an accuracy of 0.731458. In the fast text training process, the relationship between loss and loss validation can be seen in Fig. 11, while the relationship for accuracy and accuracy validation can be seen in Fig. 12. Meanwhile, the configuration matrix test results on the fast text architecture can be seen in table XI.

IV. CONCLUSION
In this study, the best accuracy, precision, recall, and f-1score results were in the following LSTM. Word embedding architecture, LSTM with Word2Vec with 50 units and 50 dropouts resulted in 73,15%. In comparison, LSTM with GloVe architecture with 50 units and 30 dropouts resulted in 60,1%, for LSTM with Fast Text architecture obtained the best architecture of 50 units and 50 dropouts, resulting in 73,15%. LSTM obtains the best accuracy with Word2Vec and LSTM with FastText, Fast Text has the advantage that it can handle the problem of out vocabulary, but in this study, it cannot improve the accuracy of Word2Vec. This study has not been able to produce very good accuracy; this is because of the data used. In future works, To get even better results, it is expected to apply other deep learning methods, such as CNN, BiLSTM, etc. It is hoped that more data will be used in future studies.