Title: Two new machine learning approaches for text classification
Text document classification is one of the most well studied applications of machine learning. Yet this technology is still limited by practical difficulties and invalid underlying assumptions.
First, many people who want text classifiers do not have the time or resources to annotate a dataset. They often employ a heuristic alternative: they create word lists for each label class, and then perform prediction by selecting the class whose list matches the largest number of words in the text. This heuristic is theoretically unjustified, and mistakenly assigns the same importance to every word in the list. I show that list-based classification can be viewed as a (very!) special case of Naive Bayes. Based on this analysis, it is possible to estimate weights for each word without supervision, using the method-of-moments.
Second, machine learning approaches to text classification nearly always begin with an IID assumption. Yet words can mean different things to different people, raising the possibility for misunderstandings even in human-human conversation. One potential solution is to relax the IID assumption by personalizing text classifiers to the author. An apparent roadblock is the challenge of obtaining labeled data for each author. I will present a method that sidesteps this requirement by relying on the sociological theory of homophily, which states that people who are socially connected tend to share personal traits. This idea can be formalized by estimating node embeddings for each individual in a social network, and then using these embeddings to drive a social attentional mechanism in a neural ensemble classifier. The resulting system obtains significant improvements on sentiment analysis in Twitter. This project is joint work with Yi Yang.