Analisis DistilBERT dengan Support Vector Machine (SVM) untuk Klasifikasi Ujaran Kebencian pada Sosial Media Twitter

  • Naufal Azmi Verdikha Universitas Muhammadiyah Kalimantan Timur
  • Reza Habid Universitas Muhammadiyah Kalimantan Timur
  • Asslia Johar Latipah Universitas Muhammadiyah Kalimantan Timur
Keywords: hate speech, SVM classification, PCA dimension reduction, F1-Score

Abstract

Hate speech is a significant issue in content management on social media platforms. Effective classification of hate speech plays a crucial role in maintaining a safe social media environment, combating discrimination, and protecting users. This study evaluates a hate speech classification model using SVM with linear and polynomial kernels. The dataset used consists of labeled Indonesian-language tweets. The importance of developing an effective classification model to address hate speech has led to the utilization of DistilBERT as a feature extraction method. However, DistilBERT has high-dimensional features, necessitating dimensionality reduction to reduce model complexity. Therefore, in this study, the PCA dimensionality reduction method is implemented with various scenarios of dimensionality, namely 10, 20, 30, 40, and 50. Evaluation is performed using F1-Score, and the entire study is evaluated using 10-fold cross-validation. The evaluation results indicate that in the scenario with a linear kernel, the model achieves the highest F1-Score of 0.75 in the 50-dimensional scenario. Meanwhile, in the scenario with a polynomial kernel, the model achieves the highest F1-Score of 0.7857 in the 50-dimensional scenario. These findings demonstrate that the use of a polynomial kernel with 50 dimensions yields the best performance in classifying hate speech.

References

A. S. Cahyono, “Pengaruh Media Sosial Terhadap Perubahan Sosial Masyarakat di Indonesia,” J. Publiciana, vol. 9, no. 1, pp. 140–157, 2018, [Online]. Available: https://journal.unita.ac.id/index.php/publiciana/article/view/79

A. R. Isnain, A. I. Sakti, D. Alita, and N. S. Marga, “Sentimen Analisis Publik Terhadap Kebijakan Lockdown Pemerintah Jakarta Menggunakan Algoritma Svm,” Jdmsi, vol. 2, no. 1, pp. 31–37, 2021, [Online]. Available: https://t.co/NfhnfMjtXw

A. P. J. Dwitama, “Deteksi Ujaran Kebencian Pada Twitter Bahasa Indonesia Menggunakan Machine Learning : Reviu Literatur,” J. SNATi, vol. 1, no. 1, pp. 31–39, 2021.

I. M. Kardiyasa, A. A. S. L. Dewi, and N. M. S. Karma, “Sanksi Pidana Terhadap Ujaran Kebencian (Hate Speech),” J. Analog. Huk., vol. 2, no. 1, pp. 78–82, 2020, doi: 10.22225/ah.2.1.1627.78-82.

I. Iswandi, I. S. Suwardi, and N. U. Maulidevi, “Penelitian Awal : Otomatisasi Interpretasi Data Akuntansi Berbasis Natural Language Processing mudah untuk digunakan , sehingga manusia berharap akan dapat berbicara kepada karena lambatnya pemahaman terhadap transaksi yang terjadi . menampilkan informasi,” J. Sist. Inf., vol. 5, no. 2, pp. 622–628, 2013.

F. Fajri, B. Tutuko, and S. Sukemi, “Membandingkan Nilai Akurasi BERT dan DistilBERT pada Dataset Twitter Tahapan Penelitian,” vol. 8, no. 2, 2022.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” pp. 2–6, 2019, [Online]. Available: http://arxiv.org/abs/1910.01108

S. K. Akpatsa et al., “Online News Sentiment Classification Using DistilBERT,” J. Quantum Comput., vol. 4, no. 1, pp. 1–11, 2022, doi: 10.32604/jqc.2022.026658.

M. Jojoa, P. Eftekhar, B. Nowrouzi-Kia, and B. Garcia-Zapirain, “Natural language processing analysis applied to COVID-19 open-text opinions using a distilBERT model for sentiment categorization,” AI Soc., no. 0123456789, 2022, doi: 10.1007/s00146-022-01594-w.

Adiwijaya, U. N. Wisesty, E. Lisnawati, A. Aditsania, and D. S. Kusumo, “Dimensionality reduction using Principal Component Analysis for cancer detection based on microarray data classification,” J. Comput. Sci., vol. 14, no. 11, pp. 1521–1530, 2018, doi: 10.3844/jcssp.2018.1521.1530.

M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” pp. 46–57, 2019, doi: 10.18653/v1/w19-3506.

J. Shlens, “A Tutorial on Principal Component Analysis,” 2014, [Online]. Available: http://arxiv.org/abs/1404.1100

I. A. M. Supartini, I. K. G. SUKARSA, and I. G. A. M. SRINADI, “Analisis Diskriminan Pada Klasifikasi Desa Di Kabupaten Tabanan Menggunakan Metode K-Fold Cross Validation,” E-Jurnal Mat., vol. 6, no. 2, p. 106, 2017, doi: 10.24843/mtk.2017.v06.i02.p154.

F. S. Jumeilah, “Penerapan Support Vector Machine (SVM) untuk Pengkategorian Penelitian,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 1, no. 1, pp. 19–25, 2017, doi: 10.29207/resti.v1i1.11.

N. M. Patil and M. U. Nemade, “Music Genre Classification Using MFCC, K-NN and SVM Classifier,” Int. J. Comput. Eng. Res. Trends, vol. 4, no. 2, pp. 2349–7084, 2017, [Online]. Available: www.ijcert.org

Published
2023-12-30
Section
Articles