MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization
DOI:
https://doi.org/10.22452/mjcs.vol38no1.4Keywords:
Natural Language Processing, Machine Learning, Extra Tree, Offensive Text, Text Classification, Decision Tree, Ensemble Method, Malay LanguageAbstract
Cyberbullying has increased globally, with offensive text contributing significantly. Detecting offensive text in Malay is challenging due to non-standard Malay text, unique social media writing styles, a lack of standardization, and limited resources. This study proposes the Malay Offensive Text Classification (MOTEC) framework to address these challenges. The MOTEC framework incorporates a Malay standardization preprocessing task, utilizing three specialized dictionaries: (a) abbreviations, (b) noisy text, and (c) Malaysian dialects. This approach enhances data quality by converting non-standard text into standardized Malay sentences before classification. For feature extraction, the framework employs Term Frequency-Inverse Document Frequency (TF-IDF). This statistical method evaluates the importance of words in a document relative to a collection of documents, coupled with an Extra Tree classifier for the classification process. Evaluating the MOTEC framework using a private dataset collected from Twitter, this study achieved a classification accuracy of 94%, significantly outperforming other studies, which reported an accuracy of 84%. The MOTEC framework substantially improves the classification of offensive Malay text by enhancing accuracy, reducing execution time, and improving data quality through effective language standardization.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Malaysian Journal of Computer Science

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

