Scaling Semi-supervised Naive Bayes with Feature Marginals
Michael Lucas and Doug Downey
The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
Sofia, Bulgaria, August 4-9, 2013
Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data. SSL techniques are often effective in text classiﬁcation, where labeled data is scarce but large unlabeled corpora are readily available. However, existing SSL tech- niques typically require multiple passes over the entirety of the unlabeled data, meaning the techniques are not applicable to large corpora being produced today.
In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classiﬁcation that scales to massive unlabeled data sets. We present a novel learning algorithm, which optimizes a Naive Bayes model to accord with statis- tics calculated from the unlabeled corpus. In experiments with text topic classiﬁca- tion and sentiment analysis, we show that our method is both more scalable and more accurate than SSL techniques from previ- ous work.
Conference Manager (V2.61.0 - Rev. 2792M)