Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)


Abstract
The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.
References
A.F.Yavi, “Klasifikasi Artikel Berbahasa Indonesia untuk Mendeteksi Clickbait menggunakan Metode Naïve Bayes,” Journal of Information and Technology (J-INTECH), vol. 06, no. 01 pp. 141–147, 2018.
M.Rizky and M.R. Kertanegara, “Penggunaan Clickbait Headline pada Situs Berita dan Gaya Hidup Muslim dream.co.id,” Mediator Jurnal Komunikasi, vol. 11, no. 1 pp. 31–43, 2018.
A. Agrawal, "Clickbait Detection using Deep Learning," in International Conference on Next Generation Computing Technologies, 2016, pp. 268–272.
B.W. Rauf, S. Raharjo, H. Sismoro, “Deteksi Clickbait dengan Sentence Scoring Based On Frequency di Detik.Com,” Jurnal Teknologi Informasi (JurTI), 2020, vol. 4, no.2, pp. 247–252.
R. Sagita, U. Enril, A. Primajaya, “Klasifikasi Berita Clickbait menggunakan K-Nearest Neighbor (KNN),” Journal of Information System, 2020, vol. 5, no.2, pp. 230–239.
S. Jumun, L. Lou, N. Wongsap, "Thai Clickbait Headline News Classification and its Characteristics," in International Conference on Embedded Systems and Intelligent Technology & International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES), 2018.
P.S. Hadi, Muljono, A.Z. Fanani, G.F. Shidik, Purwanto and F. Alzami, "Using Extra Weight in Machine Learning Algorithms for Clickbait Detection of Indonesia Online News Headlines," 2021 International Seminar on Application for Technology of Information and Communication (iSemantic), 2021, pp.37-41, doi: 10.1109/iSemantic
2021.9573213.
William and Y. Sari, "CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines," Data in Brief, vol. 32, 2020, doi: 10.106/j.dib.2020.106231.
B.U. Nadia and I.A. Iswanto, "Indonesian Clickbait Detection Using Improved Backpropagation Neural Network," 2021 4th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2020, pp.252-256, doi: 10.1109/ISRITI54043.2021.
M.A. Shaikh and S. Annappanavar, "A comparative approach for clickbait detection using deep learning," in 2020 IEEE Bombay Section Signature Conference (IBSSC), pp. 21-24, 2020, doi:10.1109/IBSSC51096.2020.9332172.
M. Bilal, A.A. Almazroi, "Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews", Electronic Commerce Reseach, 2022, doi: 10.1007/s10660-022-09560-w.
Gonzalez-Carvajal and E.C. Garrido-Merchan, "Comparing BERT against traditional machine learning, Preprint arXiv: 2005.13012, 2020, http://arxiv.org/ abs/2005.13012.
J. Devlin, M.W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, 2019.
B. Wilie, K. Vincentio, G.I. Winata, S. Cahyawijaya, X. Li, Z.Y. Lim, S. Soleman, R. Mahendra. P. Fung, S. Bahar, A. Purwarianti, "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, doi: 10.4850/arXiv.2009.05387.
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "Albert: A lite bert for self-supervised learning of language representations," in International Conference on Learning Representations, 2020.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "Roberta: A robustly optimized bert pretraining approach", Preprint arXiv:19071.1692, 2019, http://arxiv.org/abs/19071.1692.
K. Clark, M.T. Luong, Q.V. Le, and C.D. Manning, "ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators," in The International Conference on Learning Representations (ICLR) 2020, 2020
Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, S. Yan, "ConvBERT: Improving BERT with Span-based Dynamic Concolution," Preprint arXiv: 2008.02496, 2020, http://arxiv.org/ abs/2008.02496.
V. Sanh, L. Debut, J. Chaumond, T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," Preprint arXiv:1910.01108, 2020, https://arxiv.org/abs/1910.01108
X. Zhang, P. Li, and H. Li, "AMBERT: A pre-trained language model with multi-grained tokenization. Preprint arXiv:2008.11869, 2020, https://arxiv.org/abs/2008.11869
Copyright (c) 2022 Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.