Systematic Literature Review of Arabic NLP Datasets: A Meta-Study
DOI:
https://doi.org/10.48165/gjs.2026.3102Keywords:
Arabic Natural Language Processing; PRISMA Systematic Review; Sentiment Analysis; Dialect Identification; Dataset CataloguesAbstract
This meta-analysis is a systematic review and synthesis of the Arabic Natural Language Processing (NLP) dataset landscape, in accordance with PRISMA guidelines. The review locates and categorizes publicly accessible datasets in NLP tasks that span a spectrum of sentiment analysis, text classification, question answering, summarization, dialect detection, and paraphrasing. Key datasets catalogues like Masader and Masader Plus are underscored as driven by enhancing discoverability and metadata standardisation, whereas task and domain-specific creation of resources are represented by ArabSis, SNAD, A-MASA, AGS and WiHArD. As can be seen, although the amount and variety of the datasets have grown in recent years, there is still a considerable number of gaps in the coverage of dialects, domain specificity and quality of annotations. The review ends with suggestions on how to extend underexploited areas, annotation behavior, and open-access to accelerate the Arabic NLP research and application.
References
Doughan, Z., Itani, S., & Itani, S. (2025). ArabSis: Arabic corpus sentiment analysis. IEEE Access, 13, 81083–81095. https://doi.org/10.1109/ACCESS.2025.3567755
Alsaleh, D., Alamir, M., & Marie-Sainte, S. (2020). SNAD Arabic dataset for deep learning. 630–640. https://doi.org/10.1007/978-3-030-55180-3_47
Al-Shameri, N., & Al-Khalifa, H. (2024). Arabic paraphrased parallel synthetic dataset. Data in Brief, 57.
Abdallah, A., Kasem, M., Abdalla, M., Mahmoud, M., Elkasaby, M., Elbendary, Y., & Jatowt, A. (2024). ArabicaQA: A comprehensive dataset for Arabic question answering. Proceedings of the 47th International ACM SIGIR Conference.
Ahmed, A., Ali, N., Alzubaidi, M., Zaghouani, W., Abd-Alrazaq, A., & Househ, M. (2022). Arabic chatbot technologies: A scoping review. Computer Methods and Programs in Biomedicine Update. https://doi.org/10.1016/j.cmpbup.2022.100057
Alyafeai, Z., Al-Shaibani, M., Ghaleb, M., & Ahmad, I. (2021). Evaluating various tokenizers for Arabic text classification. Neural Processing Letters, 55, 2911–2933. https://doi.org/10.1007/s11063-022-10990-8
Alowisheq, A., Al-Twairesh, N., Altuwaijri, M., AlMoammar, A., Alsuwailem, A., Albuhairi, T., Alahaideb, W., & Alhumoud, S. (2021). MARSA: Multi-domain Arabic resources for sentiment analysis. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3120746
Boujou, E., Chataoui, H., Mekki, A., Benjelloun, S., Chairi, I., & Berrada, I. (2021). An open access NLP dataset for Arabic dialects: Data collection, labeling, and model construction. arXiv.
Altaher, Y., Fadel, A., Alotaibi, M., Alyazidi, M., Al-Mutairi, M., Aldhbuiub, M., Mosaibah, A., Rezk, A., Alhendi, A., Shal, M., Alghamdi, E., Alshaibani, M., Zakraoui, J., Mohammed, W., Gaanoun, K., Elmadani, K., Ghaleb, M., Tazi, N., Alharbi, R., Masoud, M., & Alyafeai, Z. (2022). Masader Plus: A new interface for exploring 500+ Arabic NLP datasets.
Althobaiti, M. (2021). Creation of annotated country-level dialectal Arabic resources: An unsupervised approach. Natural Language Engineering, 28, 607–648. https://doi.org/10.1017/S135132492100019X
Elmadany, A., Nagoudi, E., & Abdul-Mageed, M. (2022). ORCA: A challenging benchmark for Arabic language understanding. arXiv. https://doi.org/10.48550/arXiv.2212.10758
Abdul-Mageed, M., Elmadany, A., & Nagoudi, E. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic.
Khondaker, M., Waheed, A., Nagoudi, E., & Abdul-Mageed, M. (2023). GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. arXiv. https://doi.org/10.48550/arXiv.2305.14976
Einea, O., Elnagar, A., & Debsi, R. (2019). SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data in Brief, 25. https://doi.org/10.1016/j.dib.2019.104076
Bouchiha, D., Bouziane, A., Doumi, N., Berbouchi, F., Kebir, A., Mebarki, N., & Benameur, B. (2024). WiHArD: Wikipedia based hierarchical Arabic dataset for text classification. Proceedings of the 4th International Conference on Embedded & Distributed Systems (EDiS), 115–118. https://doi.org/10.1109/EDiS63605.2024.10783418
Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., & Darwish, K. (2020). QADI: Arabic dialect identification in the wild.
Obeidat, R., Al-Harahsheh, Y., Al-Ayyoub, M., & Gharaibeh, M. (2024). ArEntail: Manually-curated Arabic natural language inference dataset from news headlines. Language Resources and Evaluation, 59, 509–535. https://doi.org/10.1007/s10579-024-09731-1
Al-Thubaity, A., Alkhereyf, S., Al-Zahrani, W., & Bahanshal, A. (2022). CAraNER: The COVID-19 Arabic named entity corpus. Proceedings of WANLP. https://doi.org/10.18653/v1/2022.wanlp-1.1
Yagi, S., Elnagar, A., & Yaghi, E. (2024). Arabic punctuation dataset. Data in Brief, 53. https://doi.org/10.1016/j.dib.2024.110118
Alyemny, O., Al-Khalifa, H., & Mirza, A. (2023). A data-driven exploration of a new Islamic Fatwas dataset for Arabic NLP tasks. Data, 8, 155. https://doi.org/10.3390/data8100155
Eid, Y., Zayed, H., & Medhat, W. (2024). A-MASA: Arabic multi-domain aspect-based sentiment analysis datasets. Procedia Computer Science, 202–211. https://doi.org/10.1016/j.procs.2024.10.193
Habib, M., Faris, M., Alomari, A., & Faris, H. (2021). AltibbiVec: A word embedding model for medical and health applications in the Arabic language. IEEE Access, 9, 133875–133888. https://doi.org/10.1109/ACCESS.2021.3115617
Malaysha, S., El-Haj, M., Ezzini, S., Khalilia, M., Jarrar, M., Almujaiwel, S., Berrada, I., & Bouamor, H. (2024). AraFinNLP 2024: The first Arabic financial NLP shared task.
Bashir, M., Azmi, A., Nawaz, H., Zaghouani, W., Diab, M., Al-Fuqaha, A., & Qadir, J. (2021). Arabic natural language processing for Qur’anic research: A systematic review. Artificial Intelligence Review, 56, 6801–6854. https://doi.org/10.1007/s10462-022-10313-2
Hejazi, H., & Khamees, A. (2022). Opinion mining for Arabic dialect in social media data fusion platforms: A systematic review. Fusion: Practice and Applications. https://doi.org/10.54216/FPA.090101
Elnagar, A., Yagi, S., Nassif, A., Shahin, I., & Salloum, S. (2021). Systematic literature review of dialectal Arabic: Identification and detection. IEEE Access, 9, 31010–31042. https://doi.org/10.1109/ACCESS.2021.3059504
Obiedat, R., Al-Darras, D., Alzaghoul, E., & Harfoushi, O. (2021). Arabic aspect-based sentiment analysis: A systematic literature review. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3127140
Alyami, S., Alhothali, A., & Jamal, A. (2022). Systematic literature review of Arabic aspect-based sentiment analysis.
Nelci, J., Salloum, S., & Shaalan, K. (2021). An overview of sentiment analysis with dialectical processing. Proceedings of the International Conference on Emerging Technologies and Intelligent Systems. https://doi.org/10.1007/978-3-030-82616-1_1
Nassif, A., Elnagar, A., Shahin, I., & Henno, S. (2020). Deep learning for Arabic subjective sentiment analysis: Challenges and research opportunities. Applied Soft Computing, 98, 106836. https://doi.org/10.1016/j.asoc.2020.106836
Alhumoud, S., & Wazrah, A. (2021). Arabic sentiment analysis using recurrent neural networks: A review. Artificial Intelligence Review, 55, 707–748. https://doi.org/10.1007/s10462-021-09989-9
Matrane, Y., Benabbou, F., & Sael, N. (2023). A systematic literature review of Arabic dialect sentiment analysis. Journal of King Saud University – Computer and Information Sciences, 35, 101570. https://doi.org/10.1016/j.jksuci.2023.101570
Alhazmi, A., Mahmud, R., Idris, N., Abo, M., & Eke, C. (2024). A systematic literature review of hate speech identification on Arabic Twitter data: Research challenges and future directions. PeerJ Computer Science, 10. https://doi.org/10.7717/peerj-cs.1966
Alrayzah, A., Alsolami, F., & Saleh, M. (2023). Challenges and opportunities for Arabic question answering systems.
Alasmari, A. (2025). A scoping review of Arabic natural language processing for mental health. Healthcare, 13. https://doi.org/10.3390/healthcare13090963
Ouali, S., & Said, E. (2024). Arabic chatbots challenges and solutions: A systematic literature review. Iraqi Journal for Computer Science and Mathematics. https://doi.org/10.52866/ijcsm.2024.05.03.007
Bourahouat, G., Abourezq, M., & Daoudi, N. (2024). Word embedding as a semantic feature extraction technique in Arabic natural language processing: An overview. International Arab Journal of Information Technology, 21, 313–325. https://doi.org/10.34028/21/2/13
Bouzahir, M., Abdelouahad, A., & Nabil, M. (2022). How far can deep learning improve Arabic part-of-speech tagging? 206–215. https://doi.org/10.1007/978-3-031-06458-6_17
Alaloye, H., Alkhodre, A., & Nabil, E. (2025). Utilizing NLP to optimize municipal services delivery using a novel municipal Arabic dataset. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2025.0160278

