START Conference Manager    

Sentence Level Dialect Identification in Arabic

Heba Elfardy and Mona Diab

The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013)
Sofia, Bulgaria, August 4-9, 2013


This paper introduces a supervised approach for performing dialect identification between Modern Standard Arabic (MSA) and Egyptian Dialectal Arabic (EDA). We use token level labels to derive sentence-level features. These features are then used with other features (perplexity, informality and meta features) to train a generative classifier that predicts the correct label of each sentence in the given input text. The system achieves an accuracy of 85.4\% on a cross-validation set (beating a previously proposed system that uses the same dataset) and 78.1% on a held-out test set, reflecting a significant gain over a strong baseline of 72.1% which makes decisions alone based on percentage of DA words, and a majority baseline of 56.1% on the test set.

START Conference Manager (V2.61.0 - Rev. 2792M)