Typesetting for Improved Readability using Lexical and Syntactic Information
Ahmed Salama, Kemal Oflazer and Susan Hagan
The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013)
Sofia, Bulgaria, August 4-9, 2013
We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units}for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize raggedness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2\% Precision, 90.2\% Recall (89.7\% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F1.
Conference Manager (V2.61.0 - Rev. 2792M)