Systematic Literature Review of Arabic NLP Datasets: A Meta-Study

Amani Jamal

doi:10.48165/gjs.2026.3102

Authors

Amani Jamal Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

DOI:

https://doi.org/10.48165/gjs.2026.3102

Keywords:

Arabic Natural Language Processing; PRISMA Systematic Review; Sentiment Analysis; Dialect Identification; Dataset Catalogues

Abstract

This meta-analysis is a systematic review and synthesis of the Arabic Natural Language Processing (NLP) dataset landscape, in accordance with PRISMA guidelines. The review locates and categorizes publicly accessible datasets in NLP tasks that span a spectrum of sentiment analysis, text classification, question answering, summarization, dialect detection, and paraphrasing. Key datasets catalogues like Masader and Masader Plus are underscored as driven by enhancing discoverability and metadata standardisation, whereas task and domain-specific creation of resources are represented by ArabSis, SNAD, A-MASA, AGS and WiHArD. As can be seen, although the amount and variety of the datasets have grown in recent years, there is still a considerable number of gaps in the coverage of dialects, domain specificity and quality of annotations. The review ends with suggestions on how to extend underexploited areas, annotation behavior, and open-access to accelerate the Arabic NLP research and application.

Author Biography

Amani Jamal, Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

Center of Research Excellence in Artificial Intelligence and Data Science, King Abdulaziz university, Jeddah, Saudi Arabia

References

Doughan, Z., Itani, S., & Itani, S. (2025). ArabSis: Arabic corpus sentiment analysis. IEEE Access, 13, 81083–81095. https://doi.org/10.1109/ACCESS.2025.3567755

Alsaleh, D., Alamir, M., & Marie-Sainte, S. (2020). SNAD Arabic dataset for deep learning. 630–640. https://doi.org/10.1007/978-3-030-55180-3_47

Al-Shameri, N., & Al-Khalifa, H. (2024). Arabic paraphrased parallel synthetic dataset. Data in Brief, 57.

Abdallah, A., Kasem, M., Abdalla, M., Mahmoud, M., Elkasaby, M., Elbendary, Y., & Jatowt, A. (2024). ArabicaQA: A comprehensive dataset for Arabic question answering. Proceedings of the 47th International ACM SIGIR Conference.

Ahmed, A., Ali, N., Alzubaidi, M., Zaghouani, W., Abd-Alrazaq, A., & Househ, M. (2022). Arabic chatbot technologies: A scoping review. Computer Methods and Programs in Biomedicine Update. https://doi.org/10.1016/j.cmpbup.2022.100057

Alyafeai, Z., Al-Shaibani, M., Ghaleb, M., & Ahmad, I. (2021). Evaluating various tokenizers for Arabic text classification. Neural Processing Letters, 55, 2911–2933. https://doi.org/10.1007/s11063-022-10990-8

Alowisheq, A., Al-Twairesh, N., Altuwaijri, M., AlMoammar, A., Alsuwailem, A., Albuhairi, T., Alahaideb, W., & Alhumoud, S. (2021). MARSA: Multi-domain Arabic resources for sentiment analysis. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3120746

Boujou, E., Chataoui, H., Mekki, A., Benjelloun, S., Chairi, I., & Berrada, I. (2021). An open access NLP dataset for Arabic dialects: Data collection, labeling, and model construction. arXiv.

Altaher, Y., Fadel, A., Alotaibi, M., Alyazidi, M., Al-Mutairi, M., Aldhbuiub, M., Mosaibah, A., Rezk, A., Alhendi, A., Shal, M., Alghamdi, E., Alshaibani, M., Zakraoui, J., Mohammed, W., Gaanoun, K., Elmadani, K., Ghaleb, M., Tazi, N., Alharbi, R., Masoud, M., & Alyafeai, Z. (2022). Masader Plus: A new interface for exploring 500+ Arabic NLP datasets.

Althobaiti, M. (2021). Creation of annotated country-level dialectal Arabic resources: An unsupervised approach. Natural Language Engineering, 28, 607–648. https://doi.org/10.1017/S135132492100019X

Elmadany, A., Nagoudi, E., & Abdul-Mageed, M. (2022). ORCA: A challenging benchmark for Arabic language understanding. arXiv. https://doi.org/10.48550/arXiv.2212.10758

Abdul-Mageed, M., Elmadany, A., & Nagoudi, E. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic.

Khondaker, M., Waheed, A., Nagoudi, E., & Abdul-Mageed, M. (2023). GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. arXiv. https://doi.org/10.48550/arXiv.2305.14976

Einea, O., Elnagar, A., & Debsi, R. (2019). SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data in Brief, 25. https://doi.org/10.1016/j.dib.2019.104076

Bouchiha, D., Bouziane, A., Doumi, N., Berbouchi, F., Kebir, A., Mebarki, N., & Benameur, B. (2024). WiHArD: Wikipedia based hierarchical Arabic dataset for text classification. Proceedings of the 4th International Conference on Embedded & Distributed Systems (EDiS), 115–118. https://doi.org/10.1109/EDiS63605.2024.10783418

Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., & Darwish, K. (2020). QADI: Arabic dialect identification in the wild.

Obeidat, R., Al-Harahsheh, Y., Al-Ayyoub, M., & Gharaibeh, M. (2024). ArEntail: Manually-curated Arabic natural language inference dataset from news headlines. Language Resources and Evaluation, 59, 509–535. https://doi.org/10.1007/s10579-024-09731-1

Al-Thubaity, A., Alkhereyf, S., Al-Zahrani, W., & Bahanshal, A. (2022). CAraNER: The COVID-19 Arabic named entity corpus. Proceedings of WANLP. https://doi.org/10.18653/v1/2022.wanlp-1.1

Yagi, S., Elnagar, A., & Yaghi, E. (2024). Arabic punctuation dataset. Data in Brief, 53. https://doi.org/10.1016/j.dib.2024.110118

Alyemny, O., Al-Khalifa, H., & Mirza, A. (2023). A data-driven exploration of a new Islamic Fatwas dataset for Arabic NLP tasks. Data, 8, 155. https://doi.org/10.3390/data8100155

Eid, Y., Zayed, H., & Medhat, W. (2024). A-MASA: Arabic multi-domain aspect-based sentiment analysis datasets. Procedia Computer Science, 202–211. https://doi.org/10.1016/j.procs.2024.10.193

Habib, M., Faris, M., Alomari, A., & Faris, H. (2021). AltibbiVec: A word embedding model for medical and health applications in the Arabic language. IEEE Access, 9, 133875–133888. https://doi.org/10.1109/ACCESS.2021.3115617

Malaysha, S., El-Haj, M., Ezzini, S., Khalilia, M., Jarrar, M., Almujaiwel, S., Berrada, I., & Bouamor, H. (2024). AraFinNLP 2024: The first Arabic financial NLP shared task.

Bashir, M., Azmi, A., Nawaz, H., Zaghouani, W., Diab, M., Al-Fuqaha, A., & Qadir, J. (2021). Arabic natural language processing for Qur’anic research: A systematic review. Artificial Intelligence Review, 56, 6801–6854. https://doi.org/10.1007/s10462-022-10313-2

Hejazi, H., & Khamees, A. (2022). Opinion mining for Arabic dialect in social media data fusion platforms: A systematic review. Fusion: Practice and Applications. https://doi.org/10.54216/FPA.090101

Elnagar, A., Yagi, S., Nassif, A., Shahin, I., & Salloum, S. (2021). Systematic literature review of dialectal Arabic: Identification and detection. IEEE Access, 9, 31010–31042. https://doi.org/10.1109/ACCESS.2021.3059504

Obiedat, R., Al-Darras, D., Alzaghoul, E., & Harfoushi, O. (2021). Arabic aspect-based sentiment analysis: A systematic literature review. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3127140

Alyami, S., Alhothali, A., & Jamal, A. (2022). Systematic literature review of Arabic aspect-based sentiment analysis.

Nelci, J., Salloum, S., & Shaalan, K. (2021). An overview of sentiment analysis with dialectical processing. Proceedings of the International Conference on Emerging Technologies and Intelligent Systems. https://doi.org/10.1007/978-3-030-82616-1_1

Nassif, A., Elnagar, A., Shahin, I., & Henno, S. (2020). Deep learning for Arabic subjective sentiment analysis: Challenges and research opportunities. Applied Soft Computing, 98, 106836. https://doi.org/10.1016/j.asoc.2020.106836

Alhumoud, S., & Wazrah, A. (2021). Arabic sentiment analysis using recurrent neural networks: A review. Artificial Intelligence Review, 55, 707–748. https://doi.org/10.1007/s10462-021-09989-9

Matrane, Y., Benabbou, F., & Sael, N. (2023). A systematic literature review of Arabic dialect sentiment analysis. Journal of King Saud University – Computer and Information Sciences, 35, 101570. https://doi.org/10.1016/j.jksuci.2023.101570

Alhazmi, A., Mahmud, R., Idris, N., Abo, M., & Eke, C. (2024). A systematic literature review of hate speech identification on Arabic Twitter data: Research challenges and future directions. PeerJ Computer Science, 10. https://doi.org/10.7717/peerj-cs.1966

Alrayzah, A., Alsolami, F., & Saleh, M. (2023). Challenges and opportunities for Arabic question answering systems.

Alasmari, A. (2025). A scoping review of Arabic natural language processing for mental health. Healthcare, 13. https://doi.org/10.3390/healthcare13090963

Ouali, S., & Said, E. (2024). Arabic chatbots challenges and solutions: A systematic literature review. Iraqi Journal for Computer Science and Mathematics. https://doi.org/10.52866/ijcsm.2024.05.03.007

Bourahouat, G., Abourezq, M., & Daoudi, N. (2024). Word embedding as a semantic feature extraction technique in Arabic natural language processing: An overview. International Arab Journal of Information Technology, 21, 313–325. https://doi.org/10.34028/21/2/13

Bouzahir, M., Abdelouahad, A., & Nabil, M. (2022). How far can deep learning improve Arabic part-of-speech tagging? 206–215. https://doi.org/10.1007/978-3-031-06458-6_17

Alaloye, H., Alkhodre, A., & Nabil, E. (2025). Utilizing NLP to optimize municipal services delivery using a novel municipal Arabic dataset. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2025.0160278

Systematic Literature Review of Arabic NLP Datasets: A Meta-Study

Authors

DOI:

Keywords:

Abstract

Author Biography

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Global Journal

Subscription

Systematic Literature Review of Arabic NLP Datasets: A Meta-Study

Authors

DOI:

Keywords:

Abstract

Author Biography

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Global Journal

subscribe-us-for-latest-update

Subscription