Difference between revisions of "Tackstrom and McDonald, ECIR 2011. Discovering fine-grained sentiment with latent variable structured prediction models"
(→Method) |
|||
(30 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
* Correlated with the observed document label and, | * Correlated with the observed document label and, | ||
* Flexible enough to disagree when contextual evidence suggests otherwise. | * Flexible enough to disagree when contextual evidence suggests otherwise. | ||
+ | |||
=== Approach === | === Approach === | ||
− | They start with the supervised fine-to-coarse sentiment model described in [[RelatedPaper::McDonald et al., 2007]. | + | They start with the supervised fine-to-coarse sentiment model described in [[RelatedPaper::McDonald et al., 2007]]. |
Let <math> d </math> be a document consisting of <math> n </math> sentences, | Let <math> d </math> be a document consisting of <math> n </math> sentences, | ||
<math> | <math> | ||
\textbf {s} = (s_i)_{i=1}^{n} | \textbf {s} = (s_i)_{i=1}^{n} | ||
− | </math> Let the document level sentiment and sentence level sentiment be denoted by | + | </math> Let the document level sentiment and sentence level sentiment be denoted by |
− | <math> \textbf {y^{d | + | <math> |
+ | \textbf {y}^{d} = (y^d, \textbf {y}^s) | ||
+ | </math> | ||
+ | be the random variables that include the document level sentiment, <math> y^{d} </math>, and the sequence of sentence level sentiment, | ||
+ | <math> | ||
+ | \textbf {y}^{s} = (y^{s}_{i})_{i=1}^{n} | ||
</math> | </math> | ||
+ | |||
+ | |||
+ | All random variables take values in <math> \{ POS, \; NEG, \; NEU \} </math> for positive, negative and neutral sentiment, respectively. The authors hypothesize that there is a sequential relationship between sentence sentiment and that the document sentiment is influenced by all sentences (and vice versa). A first order Markov property is assumed, according to which each sentence variable, <math> y_{i}^{s} </math> is independent of all other variables, conditioned on the document variable <math> y_{d} </math> and its adjacent sentences, <math> y^{s}_{i-1}, y^{s}_{i+1} </math>. | ||
+ | |||
+ | The graphical model for the following formulation is represented in the figure below: | ||
+ | |||
+ | [[File:hcrf_1.jpg]] | ||
+ | |||
+ | In the figure above, a graphical model with latent sentence level states is shown. Dark grey nodes are observed variables and white nodes are unobserved. Light grey nodes are observed at training time. Dashed and dotted regions indicate the maximal cliques at position <math> i </math>. | ||
+ | |||
+ | In the HCRF model above, the conditional probability of the observed variables is obtained by marginalizing over the posited hidden variables, given as, | ||
+ | <math> | ||
+ | p_{\theta}(y^{d}|\textbf {s}) = \sum_{\textbf {y}^s} p_{\theta} (y^{d}, \textbf {y}^s | \textbf{s}). | ||
+ | </math> | ||
+ | |||
+ | As indicated in the figure above, there are two maximal cliques at each position <math> i </math>. One involving only the sentence <math> s_i </math> and its corresponding latent variable <math> y_{i}^{s} </math> and one involving the consecutive latent variables <math> y_{i}^{s}, y_{i-1}^{s} </math> and the document variable <math> y_{d} </math>. | ||
+ | |||
+ | The assignment of the document variable <math> y_{d} </math> is thus independent of the input <math> \textbf {s} </math>, conditioned on the sequence of latent sentence variables <math> \textbf {y}^{s} </math>. This distinction is important for learning predictive latent variables as it creates a bottleneck between the input sentences and the document label. | ||
+ | |||
+ | It was observed that while training HCRFs, using hard estimation gave slightly better performance as opposed to doing MAP estimate of the parameters with respect to the marginal conditional log-likelihood of observed variables, assuming a Normal prior distribution. | ||
+ | |||
+ | Assuming <math> D = \{ s_{j}, y_{j}^{d} \}_{j=1}^{m} </math> as the training set of document/document-label pairs, the parameter <math> \theta </math> is estimated in the following manner for HCRFs: | ||
+ | |||
+ | [[File:hcrf_obj_function.jpg]] | ||
+ | |||
+ | In <math> (1) </math>, parameter <math> \theta </math> can be estimated by using the [[UsesMethod::Stochastic Gradient Descent | stochastic gradient descent]] algorithm for 75 iterations with a fixed step size <math> \eta </math>. | ||
+ | |||
+ | [[UsesMethod::Viterbi]] algorithm is used in equation <math> (2) </math> in the predicting the optimal assignment of <math> (y^{d}, \textbf {y}^{s}) </math> in the same manner as used in [[UsesMethod::conditional random fields]]. | ||
+ | |||
== Experiments and Results == | == Experiments and Results == | ||
− | + | The authors constructed a large balanced corpus of consumer reviews from a range of domains. | |
+ | === Dataset === | ||
+ | * A training set was created by sampling a total of 143,580 positive, negative and neutral reviews from five different domains: books, dvds, electronics, music and video games. | ||
+ | * Document sentiment labels were obtained by labeling one and two star reviews as negative (NEG), three star reviews as neutral (NEU), and four and five star reviews as positive (POS). | ||
+ | * The total number of sentences is about 1.5 million. | ||
− | + | [[File:hcrf_results1.jpg]] | |
+ | Tables 1 and 2 above show the distribution of sentence labels per category and distributions of labels in the documents respectively. | ||
=== Results === | === Results === | ||
+ | The authors compare the performance of their approach against the vote-flip algorithm, VoteFlip which uses a polarity lexicon, [[RelatedPaper::Wilson et al., 2007]] and one statistical state-of-the-art approach based on Document as Sentence (DaS)which trains a document classifier on the coarse-labeled training data, but applies it to sentences independently at test time. | ||
+ | [[File:hcrf_results2.jpg]] | ||
+ | |||
+ | Table 3 shows results for each model in terms of sentence and document accuracy as well as <math> F1 </math>-scores for each sentence sentiment category. HCRF performs significantly better when using a sufficiently large training data set in terms of sentence level accuracy. Adding more training data improves the accuracy of HCRF model for document-level accuracy. | ||
== Related Papers == | == Related Papers == | ||
+ | [1] [[RelatedPaper::Quattoni et al., 2007 | A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. 2007. Hidden conditional random | ||
+ | fields. ''IEEE Transactions on Pattern Analysis and Machine Intelligence''.]] | ||
+ | |||
+ | [2] [[RelatedPaper::McDonald et al., 2007 | R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar. 2007. Structured models for | ||
+ | fine-to-coarse sentiment analysis. In ''Proc. ACL-2007''.]] | ||
+ | |||
+ | [3] [[RelatedPaper::Wilson et al., 2007 | T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. 2005. In ''Proc. EMNLP-2005'']] |
Latest revision as of 00:12, 29 November 2011
Contents
Citation
O. Tackstrom and R. McDonald. 2011. Discovering fine-grained sentiment with latent variable structured prediction models. In Proceedings of ECIR-2011, pp 764–773, Dublin, Ireland.
Online Version
Discovering fine-grained sentiment with latent variable structured prediction models
Summary
This paper investigates the use of latent variable structured prediction models for fine-grained sentiment analysis in the common situation where only coarse-grained supervision is available. The authors show how sentence level sentiment labels can be effectively learned from document-level supervision using hidden conditional random fields (HCRFs). The authors show improvements over both lexicon and existing machine learning based approaches. They focus on sentence level sentiment analysis.
Method
The authors observe that there is a lot of data in the form of coarse-level annotations available on the web pertaining to consumer reviews of products, movies etc. However, fine-grained labeled data for sentiment is difficult to obtain across domains for supervised learning. Hence, the authors model finer-level information as latent variables making use of the freely available coarse level annotations, using hierarchical graphical models such as HCRFs.
Based on the observations about positive and negative reviews in documents, the authors model sentence level classifications as:
- Correlated with the observed document label and,
- Flexible enough to disagree when contextual evidence suggests otherwise.
Approach
They start with the supervised fine-to-coarse sentiment model described in McDonald et al., 2007.
Let be a document consisting of sentences, Let the document level sentiment and sentence level sentiment be denoted by be the random variables that include the document level sentiment, , and the sequence of sentence level sentiment,
All random variables take values in for positive, negative and neutral sentiment, respectively. The authors hypothesize that there is a sequential relationship between sentence sentiment and that the document sentiment is influenced by all sentences (and vice versa). A first order Markov property is assumed, according to which each sentence variable, is independent of all other variables, conditioned on the document variable and its adjacent sentences, .
The graphical model for the following formulation is represented in the figure below:
In the figure above, a graphical model with latent sentence level states is shown. Dark grey nodes are observed variables and white nodes are unobserved. Light grey nodes are observed at training time. Dashed and dotted regions indicate the maximal cliques at position .
In the HCRF model above, the conditional probability of the observed variables is obtained by marginalizing over the posited hidden variables, given as,
As indicated in the figure above, there are two maximal cliques at each position . One involving only the sentence and its corresponding latent variable and one involving the consecutive latent variables and the document variable .
The assignment of the document variable is thus independent of the input , conditioned on the sequence of latent sentence variables . This distinction is important for learning predictive latent variables as it creates a bottleneck between the input sentences and the document label.
It was observed that while training HCRFs, using hard estimation gave slightly better performance as opposed to doing MAP estimate of the parameters with respect to the marginal conditional log-likelihood of observed variables, assuming a Normal prior distribution.
Assuming as the training set of document/document-label pairs, the parameter is estimated in the following manner for HCRFs:
In , parameter can be estimated by using the stochastic gradient descent algorithm for 75 iterations with a fixed step size .
Viterbi algorithm is used in equation in the predicting the optimal assignment of in the same manner as used in conditional random fields.
Experiments and Results
The authors constructed a large balanced corpus of consumer reviews from a range of domains.
Dataset
- A training set was created by sampling a total of 143,580 positive, negative and neutral reviews from five different domains: books, dvds, electronics, music and video games.
- Document sentiment labels were obtained by labeling one and two star reviews as negative (NEG), three star reviews as neutral (NEU), and four and five star reviews as positive (POS).
- The total number of sentences is about 1.5 million.
Tables 1 and 2 above show the distribution of sentence labels per category and distributions of labels in the documents respectively.
Results
The authors compare the performance of their approach against the vote-flip algorithm, VoteFlip which uses a polarity lexicon, Wilson et al., 2007 and one statistical state-of-the-art approach based on Document as Sentence (DaS)which trains a document classifier on the coarse-labeled training data, but applies it to sentences independently at test time.
Table 3 shows results for each model in terms of sentence and document accuracy as well as -scores for each sentence sentiment category. HCRF performs significantly better when using a sufficiently large training data set in terms of sentence level accuracy. Adding more training data improves the accuracy of HCRF model for document-level accuracy.