Extracting Opinion Expressions with semi-Markov Conditional Random Fields

Citation

 author    = {Yang, Bishan  and  Cardie, Claire},
 title     = {Extracting Opinion Expressions with semi-Markov Conditional Random Fields},
 booktitle = {Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning},
 month     = {July},
 year      = {2012},
 address   = {Jeju Island, Korea},
 publisher = {Association for Computational Linguistics},
 pages     = {1335--1345},

Online version

ACLWEB 2012

Summary

This paper proposes a segment level sequence labeling technique using semi-CRFs. The main focus of the paper is to identify two types of opinion expressions in the corpus. First, direct subjective expressions. Secondly, direct expressive subjective expressions. For Exmaple :

"The International Committee of the Red Cross, [as usual $]_{ESE}$ ,[has refused to make any statements $]_{DSE}$ ".
"The Chief Minister [said $]_{DSE}$ that [the demon they have reared will eat up their own vitals $]_{ESE}$ ".

Dataset

MPQA 1.2 corpus, Wiebe et al.,2005 is used. It contains 535 news articles and 11,114 sentences with 55.89% sentences with DSEs and 57.93% with ESEs. 135 documents are used for training and 400 are used for testing.

Background

The previous work of sequence tagging in natural language processing has been limited to token level. T

Methodology

Semi-CRF

A sentence s is divided into segments $<s_{1},...,s_{n}>$ . Where $s_{i}$ = $(t_{i},u_{i},y_{i})$ such that $t_{i}$ is the start position of segment $s_{i}$ , $u_{i}$ is the end position and $y_{i}$ is the label of the segment. Segment length is limited to maximum length seen in the corpus.

Feature function g(x,s,i) is short representation of $g(s,t_{i},u_{i},y_{i},y_{i-1})$ . The conditional probability of a segmentation s give a sequence x is defined as $p(s|x)={\frac {1}{\sum _{s^{'}\in S}exp{\sum _{i}\sum _{k}\lambda _{k}g_{k}(i,x,s^{'})}}}exp{\sum _{i}\sum _{k}\lambda _{k}g_{k}(i,x,s)}$ .

The correct segmentation s of a sentence is defined as a sequence of entity segments(DSE or ESE) and non-entity segments (they are unit length segments that are to be ignored).

Extended Semi-CRF for Opinion Expression Extraction

The objective is to learn the entity boundaries and labels for opinion expression extraction.

First modification, the segment length should not be fixed to maximum segment length based on observed entities, it should be unbounded to allow any length segment candidates.
Second,the segment units are generated from sentence parse tree.

Segment Construction Algorithm

Function $commGroup(U_{i},...,U_{j})$ returns true if parent node of $U_{i},...,U_{j}$ have the same rightmost child in their subtrees, otherwise it returns false. The above generated candidate segments are then validated using.

TBD

Features:

Experimental Results

Evaluation Metrics

Binary Overlap : Predicted expression is correct if it overlaps with a correct expression.
Proportional Overlap : Only the overlapping proportion of predicted expression over correct expression is consider to be correct.

Baseline Methods

Token-level CRF approach, Breck et al.2007, is used as the baseline on MPQA dataset.
Two variation of standard CRF are used. First, segment-CRF, treats segment units obtained from parser as work tokens. Second, Syntactic-CRF, encodes segment-level syntactic information in a standard token-level CRF as input features.
Semi-CRF model ,Sarawagi and Cohen, 2004, is also used as baseline.

Results

The extended semi-CRF is labeled as new-semi-CRF.

Comparison with previous work.

Discussion

The extended semi-CRF approach outperforms original semi-CRF,Sarawagi and Cohen, 2004. But as compared to CRF it has lower precision and high recall. This is because the current approach predicted nearly twice the number of DSEs as compared to CRF and this lead to high recall and low precision. Overall the F-measure is boosted as compared to CRF.

The authors propose to add new features and better way to model context surrounding to improve performance. One should note that the semi-CRFs take longer time to train and validate then CRFs. The proposed approach took 2.25 hours for training 11,114 sentences. 2 hours for parsing sentences using Stanford Parser and 15 minutes training on a 4GB RAM, Intel Core 2 Duo CPU.

Study Plan

This paper uses semi-CRF for the labeling task. So the user should first read about semi-CRF.

Sunita Sarawagi, William W. Cohen Semi-Markov Conditional Random Fields for Information Extraction.

Extracting Opinion Expressions with semi-Markov Conditional Random Fields

Contents