Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)

Diyah Utami Kusumaning Putri; Dinar Nugroho Pratomo

doi:10.25139/inform.v7i2.4686

Diyah Utami Kusumaning Putri Department of Computer Science and Electronics, Universitas Gadjah Mada, Yogyakarta
Dinar Nugroho Pratomo Department of Electrical Engineering and Informatics, Universitas Gadjah Mada, Yogyakarta

https://doi.org/10.25139/inform.v7i2.4686

Abstract views: 928 ,

PDF downloads: 837

Keywords: Clickbait, Indonesian News Headlines, Word-Vectors, Machine Learning, Fine-Tuned BERT, Indobert Pre-Trained

Abstract

The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERT_BASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.

References

A.F.Yavi, â€œKlasifikasi Artikel Berbahasa Indonesia untuk Mendeteksi Clickbait menggunakan Metode NaÃ¯ve Bayes,â€ Journal of Information and Technology (J-INTECH), vol. 06, no. 01 pp. 141â€“147, 2018.

M.Rizky and M.R. Kertanegara, â€œPenggunaan Clickbait Headline pada Situs Berita dan Gaya Hidup Muslim dream.co.id,â€ Mediator Jurnal Komunikasi, vol. 11, no. 1 pp. 31â€“43, 2018.

A. Agrawal, "Clickbait Detection using Deep Learning," in International Conference on Next Generation Computing Technologies, 2016, pp. 268â€“272.

B.W. Rauf, S. Raharjo, H. Sismoro, â€œDeteksi Clickbait dengan Sentence Scoring Based On Frequency di Detik.Com,â€ Jurnal Teknologi Informasi (JurTI), 2020, vol. 4, no.2, pp. 247â€“252.

R. Sagita, U. Enril, A. Primajaya, â€œKlasifikasi Berita Clickbait menggunakan K-Nearest Neighbor (KNN),â€ Journal of Information System, 2020, vol. 5, no.2, pp. 230â€“239.

S. Jumun, L. Lou, N. Wongsap, "Thai Clickbait Headline News Classification and its Characteristics," in International Conference on Embedded Systems and Intelligent Technology & International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES), 2018.

P.S. Hadi, Muljono, A.Z. Fanani, G.F. Shidik, Purwanto and F. Alzami, "Using Extra Weight in Machine Learning Algorithms for Clickbait Detection of Indonesia Online News Headlines," 2021 International Seminar on Application for Technology of Information and Communication (iSemantic), 2021, pp.37-41, doi: 10.1109/iSemantic

2021.9573213.

William and Y. Sari, "CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines," Data in Brief, vol. 32, 2020, doi: 10.106/j.dib.2020.106231.

B.U. Nadia and I.A. Iswanto, "Indonesian Clickbait Detection Using Improved Backpropagation Neural Network," 2021 4th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2020, pp.252-256, doi: 10.1109/ISRITI54043.2021.

M.A. Shaikh and S. Annappanavar, "A comparative approach for clickbait detection using deep learning," in 2020 IEEE Bombay Section Signature Conference (IBSSC), pp. 21-24, 2020, doi:10.1109/IBSSC51096.2020.9332172.

M. Bilal, A.A. Almazroi, "Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews", Electronic Commerce Reseach, 2022, doi: 10.1007/s10660-022-09560-w.

Gonzalez-Carvajal and E.C. Garrido-Merchan, "Comparing BERT against traditional machine learning, Preprint arXiv: 2005.13012, 2020, http://arxiv.org/ abs/2005.13012.

J. Devlin, M.W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, 2019.

B. Wilie, K. Vincentio, G.I. Winata, S. Cahyawijaya, X. Li, Z.Y. Lim, S. Soleman, R. Mahendra. P. Fung, S. Bahar, A. Purwarianti, "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, doi: 10.4850/arXiv.2009.05387.

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "Albert: A lite bert for self-supervised learning of language representations," in International Conference on Learning Representations, 2020.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "Roberta: A robustly optimized bert pretraining approach", Preprint arXiv:19071.1692, 2019, http://arxiv.org/abs/19071.1692.

K. Clark, M.T. Luong, Q.V. Le, and C.D. Manning, "ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators," in The International Conference on Learning Representations (ICLR) 2020, 2020

Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, S. Yan, "ConvBERT: Improving BERT with Span-based Dynamic Concolution," Preprint arXiv: 2008.02496, 2020, http://arxiv.org/ abs/2008.02496.

V. Sanh, L. Debut, J. Chaumond, T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," Preprint arXiv:1910.01108, 2020, https://arxiv.org/abs/1910.01108

X. Zhang, P. Li, and H. Li, "AMBERT: A pre-trained language model with multi-grained tokenization. Preprint arXiv:2008.11869, 2020, https://arxiv.org/abs/2008.11869

Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)

Abstract

References

Template

ISSN

Visitors

Statistics

Sinta Accreditation

Access

Article Search

Tools