Bridging Languages through Etymology: The case of cross language text categorization

Vivi Nastase and Carlo Strapparava

The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
Sofia, Bulgaria, August 4-9, 2013


We propose the hypothesis that word etymology is useful for NLP applications as a bridge between languages. We support this hypothesis with experiments in cross-language (English-Italian) document categorization. In a straightforward bag-of-words experimental set-up we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa). The results show not only statistically significant, but a large improvement – a jump of almost 40 percentage points – over the raw (vanilla bag-of-words) representation.

