START Conference Manager    

Scalable Modified Kneser-Ney Language Model Estimation

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan Clark, Mohammed Mediani and Philipp Koehn

The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013)
Sofia, Bulgaria, August 4-9, 2013


We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments show improvement of 0.63 to 1.01 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task.

START Conference Manager (V2.61.0 - Rev. 2792M)