Word Embedding for Arabic Text Classification

Loading...
Thumbnail Image

Date

2025-06-15

Journal Title

Journal ISSN

Volume Title

Publisher

Mohamed Boudiaf University of M'sila

Abstract

Arabic text classification is a critical task in natural language processing (NLP) with broad applications across media, business, and education. The unique linguistic features of Arabic such as rich morphology, significant dialectal variation, and limited annotated resources pose substantial challenges for automated classification systems. This thesis investigates and compares the effectiveness of traditional machine learning models (such as SVM and Random Forest) and state-of-the-art BERT based deep learning models (such as AraBERT and QARiB) for Arabic text classification. A pre-existing, multi-dialectal dataset comprising 3,600 samples across 18 functional categories was developed and annotated by expert linguists to address the limitations of existing resources. The study evaluates each approach in terms of classification accuracy, robustness to dialectal and functional variations, and computational efficiency. Results demonstrate that while BERT-based models outperform traditional approaches in handling morphological complexity and contextual ambiguity, they require significantly greater computational resources. The findings highlight the importance of dataset diversity, dialectal representation, and resource considerations in developing robust Arabic text classification systems. This research contributes to the advancement of Arabic NLP by providing practical insights and recommendations for future model development and deployment.

Description

Keywords

Arabic text classification, natural language processing (NLP), Arabic morphology, dialectal variation, machine learning, traditional models (SVM, Random Forest), pre-trained BERT-based models (AraBERT, QARiB), annotated multidialectal corpus, classification accuracy, contextual ambiguity, computational efficiency, Arabic language resources

Citation

Collections