Comparative Performance of Machine Learning Algorithms for Detecting Online Gambling Promotional Comments on Youtube

Authors

  • Michael Angelo STMIK Time, Indonesia
  • Robet STMIK Time
  • Jackri Hendrik STMIK Time

DOI:

https://doi.org/10.26905/jtmi.v11i2.16286

Keywords:

Machine Learning, Comment Detection, Pseudo-Labeling, Online Gambling Promotion

Abstract

Online-gambling promoters increasingly exploit YouTube comment sections, using text obfuscation, Unicode characters, emojis, irregular spacing, and symbols to evade automated moderation. This study aims to identify the most effective machine-learning algorithm for detecting such promotional comments by comparing models on standard metrics (precision, recall, F1-score, accuracy). We employ semi-supervised pseudo-labelling to expand the labelled set from 1,648 to 9,111 comments without additional manual annotation, admitting only high-confidence predictions. The pipeline includes customised character normalization, selective cleaning, tokenization, stopword removal, and Nazief–Adriani stemming, followed by TF–IDF feature extraction. Four algorithms are evaluated: Multinomial Naive Bayes, Logistic Regression, Random Forest, and Support Vector Machine, with hyperparameter optimization and class balancing via SMOTE. On a 1,823-sample test set, all models achieve over 98% accuracy; SVM yields the most balanced performance, resulting in the highest F1-score for the promotion class (0.9908). Confusion matrices and learning curves indicate stable behavior without overfitting or underfitting. We therefore recommend SVM for operational deployment in automated moderation of gambling-promotion comments on YouTube. These findings provide practical guidance for platform safety teams and suggest methodological baselines for similar NLP moderation tasks. Future work should explore ensemble and deep learning approaches, incorporate character and subword-level features, and further evaluate robustness under adversarial obfuscation and domain shift.

Downloads

Download data is not yet available.

References

[1] R. B. Perdana, Ardin, I. Budi, A. B. Santoso, A. Ramadiah, And P. K. Putra, “Detecting Online Gambling Promotions On Indonesian Twitter Using Text Mining Algorithm,” Int. J. Adv. Comput. Sci. Appl., Vol. 15, No. 8, Pp. 942–949, 2024, Doi: 10.14569/Ijacsa.2024.0150893.

[2] A. Sakpal, “A Survey On Emoji And Unicode-Based Data Masking Attacks On Ai Systems,” Int. J. Res. Appl. Sci. Eng. Technol., Vol. 13, No. 6, Pp. 1805–1810, 2025, Doi: 10.22214/Ijraset.2025.72523.

[3] Y. Zhou, Y. Xiao, W. Ai, And G. Gao, “The Hidden Language Of Harm: Examining The Role Of Emojis In Harmful Online Communication And Content Moderation,” 2025, [Online]. Available: Http://Arxiv.Org/Abs/2506.00583

[4] H. Oh, “A Youtube Spam Comments Detection Scheme Using Cascaded Ensemble Machine Learning Model,” Ieee Access, Vol. 9, No. 04, Pp. 144121–144128, 2021, Doi: 10.1109/Access.2021.3121508.

[5] B. Sai, S. Vandana, K. Sreeja, N. Bhavana, And K. G. Babu, “Spam Comment Detection On Social Media A Hybrid Approach With Emoji Feature , Post-Comment Pairs And Ensemble Machine Learning,” Pp. 2156–2159, 2025.

[6] D. Jalali, W. Ikram, And S. F. Pane, “Deteksi Spam Bot Pada Komentar Youtube : Tinjauan Literatur Sistematis,” Vol. 15, No. 2, 2023.

[7] Rahman Abdillah, Ibnu Adkha, Dwi Puspita Agustin, And Nur Alam, “Sosialisasi Penerapan Algoritma Media Sosial Youtube Untuk Menaikkan Jumlah Pengunjung,” Karunia Jurnal Has. Pengabdi. Masy. Indones., Vol. 4, No. 1, Pp. 120–130, 2025, Doi: 10.58192/Karunia.V4i1.3075.

[8] P. Roshini And B. Indira, “Spam Detection For Youtube Comments Using Machine Learning Algorithms,” Int. J. Curr. Sci. Www.Ijcspub.Org, Vol. 12, No. 4, Pp. 2250–1770, 2022, [Online]. Available: Www.Ijcspub.Org

[9] A. S. Xiao And Q. Liang, “Spam Detection For Youtube Video Comments Using Machine Learning Approaches,” Mach. Learn. With Appl., Vol. 16, No. April, P. 100550, 2024, Doi: 10.1016/J.Mlwa.2024.100550.

[10] M. Tsiourlini, K. Tzafilkou, D. Karapiperis, And C. Tjortjis, “Text Analytics On Youtube Comments For Food Products,” Inf., Vol. 15, No. 10, 2024, Doi: 10.3390/Info15100599.

[11] M. L. Methods, “Sentiment Analysis Of Visitor Reviews On Baturaden Tourist Attraction Using Machine Learning Methods,” Edu Komputika J., Vol. 11, No. 1, Pp. 57–64, 2024, Doi: 10.15294/Edukom.V11i1.10561.

[12] K. Dheanis, A. Salsabila, And N. Trianasari, “Jurnal Teknologi Dan Manajemen Informatika Analisis Persepsi Produk Kosmetik Menggunakan Metode Sentiment Analysis Dan Topic Modeling ( Studi Kasus : Laneige Water Sleeping Mask ),” Vol. 7, No. 1, Pp. 1–9, 2021.

[13] Galih Ilham Maulana Putra, Muhammad Sihabudin Riyadi, Adam Maulana, And Siti Maesaroh, “Analysis Of The Application Of Machine Learning Algorithm In Spam Detection System: Literature Review,” J. Artif. Intell. Eng. Appl., Vol. 4, No. 3, Pp. 1615–1621, 2025, Doi: 10.59934/Jaiea.V4i3.965.

[14] A. Sinhal And M. Maheshwari, “An Extensive Review On Contemporary Analysis Of Comment Filtration Of Youtube Videos Using Machine Learning Techniques,” Int. J. Emerg. Technol. Adv. Eng., Vol. 12, No. 9, Pp. 130–141, 2022, Doi: 10.46338/Ijetae0922_14.

[15] K. Li, “Analysis Of Spam Classification Based On Naive Bayes And Random Forest Model,” Adv. Econ. Manag. Polit. Sci., Vol. 84, No. 1, Pp. 250–257, 2024, Doi: 10.54254/2754-1169/84/20240817.

[16] A. Dewandaru And J. S. Wibowo, “Jurnal Teknologi Dan Manajemen Informatika Analisis Sentimen Dan Klasifikasi Tweet Terkait Mutasi Covid-19 Menggunakan Metode Naïve Bayes Classifier,” Vol. 8, No. 1, Pp. 32–38, 2022.

[17] N. Venkatramana, C. Jashnavi, J. J. Guptha, P. C. Sekhar, And M. Chandra, “Random Tree Classifier : A Machine Learning Spam Comment Detection On Youtube,” Pp. 311–317, 2024.

[18] G. Airlangga, “Spam Detection On Youtube Comments Using Advanced Machine Learning Models: A Comparative Study,” Brill. Res. Artif. Intell., Vol. 4, No. 2, Pp. 500–508, 2024, Doi: 10.47709/Brilliance.V4i2.4670.

[19] A. N. Anggraeni, K. Mustofa, And S. Priyanta, “Comparison Of Filter And Wrapper Based Feature Selection Methods On Spam Comment Classification,” Ijccs (Indonesian J. Comput. Cybern. Syst., Vol. 15, No. 3, P. 245, 2021, Doi: 10.22146/Ijccs.66965.

[20] F. Y. Pamuji And V. P. Ramadhan, “Jurnal Teknologi Dan Manajemen Informatika Komparasi Algoritma Random Forest Dan Decision Tree Untuk Memprediksi Keberhasilan Immunotheraphy,” Vol. 7, No. 1, Pp. 46–50, 2021.

[21] S. Balaraman, “Comparison Of Classification Models For Breast Cancer Identification Using Google Colab,” Preprints., No. May, Pp. 1–11, 2020, Doi: 10.20944/Preprints202005.0328.V1.

[22] A. Deubel, J. Breuer, J. Kohne, And M. R. Mohseni, “Overview Of Working With Data From Youtube,” Pp. 1–17, 2024, Doi: 10.60762/Ggdbd24012.1.0.

[23] A. Glazkova, “A Comparison Of Synthetic Oversampling Methods For Multi-Class Text Classification,” No. 18, Pp. 1–12, 2020, [Online]. Available: Http://Arxiv.Org/Abs/2008.04636

[24] D. Budiman, Z. Zayyan, A. Mardiana, And A. A. Mahrani, “Email Spam Detection: A Comparison Of Svm And Naive Bayes Using Bayesian Optimization And Grid Search Parameters,” J. Student Res. Explor., Vol. 2, No. 1, Pp. 53–64, 2024, Doi: 10.52465/Josre.V2i1.260.

Downloads

Published

17-12-2025