Word Embedding for Arabic Text Classification
Loading...
Date
2025-06-15
Journal Title
Journal ISSN
Volume Title
Publisher
Mohamed Boudiaf University of M'sila
Abstract
Arabic text classification is a critical task in natural language processing (NLP) with
broad applications across media, business, and education. The unique linguistic features of
Arabic such as rich morphology, significant dialectal variation, and limited annotated resources
pose substantial challenges for automated classification systems. This thesis investigates and
compares the effectiveness of traditional machine learning models (such as SVM and Random
Forest) and state-of-the-art BERT based deep learning models (such as AraBERT and QARiB)
for Arabic text classification. A pre-existing, multi-dialectal dataset comprising 3,600 samples
across 18 functional categories was developed and annotated by expert linguists to address the
limitations of existing resources. The study evaluates each approach in terms of classification
accuracy, robustness to dialectal and functional variations, and computational efficiency.
Results demonstrate that while BERT-based models outperform traditional approaches in
handling morphological complexity and contextual ambiguity, they require significantly greater
computational resources. The findings highlight the importance of dataset diversity, dialectal
representation, and resource considerations in developing robust Arabic text classification
systems. This research contributes to the advancement of Arabic NLP by providing practical
insights and recommendations for future model development and deployment.
Description
Keywords
Arabic text classification, natural language processing (NLP), Arabic morphology, dialectal variation, machine learning, traditional models (SVM, Random Forest), pre-trained BERT-based models (AraBERT, QARiB), annotated multidialectal corpus, classification accuracy, contextual ambiguity, computational efficiency, Arabic language resources