Velikovich et al, NAACL 2010
- The viability of web-derived polarity lexicons
- Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan, Ryan McDonald
- Forthcoming NAACL 2010.
(writeup was written for a draft version of the paper)
This paper examines unsupervised methods for extending a polarity Lexicon. They use Distributional_Similarity from a very large (and proprietary) web corpus, and extend a seed set of terms with their own LabelPropagation technique. They create an impressively large and broad lexicon, and demonstrate that it improves a word-counting-based (i.e., lexicon-based) sentiment classifier.
I interpret a polarity lexicon to be a weighted list of "positive" and "negative" tags for a list of words and phrases. This is helpful for doing sentiment classification; it's a form of lexical knowledge you can apply to appearances of words by, say, counting positive and negative words. Inferring a lexicon is an a-contextual Semantic orientation of words problem.
They are interested in using unlabeled data from the web. In this way it is reminiscent of Turney,_ACL_2002, which learned polarity information for words by using co-occurrence statistics from the web. This paper uses Distributional_Similarity as the core information.
A word's or phrases distributional similarity to another word or phrase is a function of the surrounding contexts of each time it appears. (It is superficially similar to but NOT the same as co-occurrence statistics between the words). It measures to what extent the words/phrases can be similar slot-fillers. For example, here's a quick example I threw together from some twitter data. I'm comparing three words: "great", "good", and "bad".
A context is usually defined as the tuple of several words immediately to the left and right of the word. You then look at the word's many appearances in the corpus and count the contexts. Words that can be used in similar ways should get similar distribution of context appearances. Above, the columns are just normalized counts (probability of that context for the word). The hope is that words with similar sentiment polarity will have a high distributional similarity score. Indeed, in this example (with data from 2M occurrences, which is very small for this sort of thing), the cosine similarities give sim(great,good) > sim(great,bad), which is exactly what you hope.
In the paper, they do this for all (heuristically filtered) n-grams over a very large number of webpages. (They do not give many details about deriving the phrases, counts, and similarities; they note that it is a big and complex problem in its own right and say nothing more about it. I suspect they reused systems they already had access to. It sounds like CollocationDetection is involved, and CosineSimilarity is used to compare context vectors.)
Then they manually made a very small (100ish) set of positive and negative words. Then they extended that set by propagating polarity across the distributional similarity graph.
A standard approach for doing this is GraphPropagation, which was used to derive a polarity lexicon in a previous paper by some of the co-authors, by propagating polarity scores over the WordNet graph. Graph propagation is a PageRank-style algorithm where you matrix multiply lots of times to flow the weights everywhere. (A node's score is discounted sum of incoming flow.) However, in this paper they argue that doesn't work and instead propose LabelPropagation, which computes a new bipartite graph between seeds and other terms in the graph, where the edge's score is the maximum (product-discounted) score for all paths on the empirical distsim graph between the seed and the term. (At least, that's what I think it was. I thought this part of the paper wasn't written super clearly.) They argue this approach avoids some of the data noise problems in general-domain web data that they claim makes graph propagation fail.
They do this separately for positive and negative seeds (two bipartite graphs are created) then compute final polarities for novel words by basically the difference between positive and negative strengths (summed across seed terms) on their nodes. Then there are some cutoffs and stuff to get the final new lexicon.
Please see my (brendan's) presentation for more details.
They demonstrate many interesting words and phrases their system learns. It is very clear that a hand-built approach towards making a polarity lexicon would be very difficult to scale to the breadth of vocabulary people are using on the web. They provide *NO* precision or quality evaluation beyond raw counts of their lexicon size.
They also demonstrate that using their lexicon improves a sentiment classifier for product reviews.
Two members of the audience took issue with several technical details of their approach. The reasons they argued graph propagation didn't work were not clear to us.
It was also noted that the contributions of the paper in terms of better lexicons/sentiment analysis are a little tricky to tease apart. It's hard to tell whether their advances were due to the algorithmic improvements or the use of large amounts of data. At least, you would believe this objection if you were unsatisfied with the paper's claims about GP vs. LP.
It is interesting to compare this to the informal study Lindsay_2008 which induced a lexicon then tried to show it was useful for an external task of predicting election polls.