Difference between revisions of "Extracting Opinion Expressions with semi-Markov Conditional Random Fields"

From Cohen Courses
Jump to navigationJump to search
Line 29: Line 29:
 
== Methodology ==
 
== Methodology ==
 
=== Semi-CRF ===
 
=== Semi-CRF ===
A sentence s is divided into segments <math> <s_1,...,s_n> </math>. Where <math> s_i</math> = <math>( t_i, u_i, y_i )</math> such that <math>t_i </math> is the start position of segment <math>s_i</math> and <math>u_i</math> is the end position of the sentence <math> s_i</math>,<math> y_i </math> is the label of the segment. Segment length is limited to maximum length seen in the corpus. Feature function :
+
A sentence s is divided into segments <math> <s_1,...,s_n> </math>. Where <math> s_i</math> = <math>( t_i, u_i, y_i )</math> such that <math>t_i </math> is the start position of segment <math>s_i</math><math>u_i</math> is the end position and <math> y_i </math> is the label of the segment. Segment length is limited to maximum length seen in the corpus. Feature function :
 
<math> g(x,s,i) = g(s,t_i,u_i,y_i,y_{i-1})</math>.
 
<math> g(x,s,i) = g(s,t_i,u_i,y_i,y_{i-1})</math>.
 
The conditional probability of a segmentation s give a sequence x is defined as
 
The conditional probability of a segmentation s give a sequence x is defined as

Revision as of 23:01, 1 October 2012

Citation

 author    = {Yang, Bishan  and  Cardie, Claire},
 title     = {Extracting Opinion Expressions with semi-Markov Conditional Random Fields},
 booktitle = {Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning},
 month     = {July},
 year      = {2012},
 address   = {Jeju Island, Korea},
 publisher = {Association for Computational Linguistics},
 pages     = {1335--1345},

Online version

ACLWEB 2012

Summary

This paper proposes a segment level sequence labeling technique using semi-CRFs. The main focus of the paper is to identify two types of opinion expressions in the corpus. First, direct subjective expressions. Secondly, direct expressive subjective expressions. For Exmaple :

  1. "The International Committee of the Red Cross, [as usual ,[has refused to make any statements".
  2. "The Chief Minister [said that [the demon they have reared will eat up their own vitals".

Dataset

MPQA 1.2 corpus, Wiebe et al.,2005 is used. It contains 535 news articles and 11,114 sentences with 55.89% sentences with DSEs and 57.93% with ESEs. 135 documents are used for training and 400 are used for testing.

Background

The previous work of sequence tagging in natural language processing has been limited to token level. T

Methodology

Semi-CRF

A sentence s is divided into segments . Where = such that is the start position of segment , is the end position and is the label of the segment. Segment length is limited to maximum length seen in the corpus. Feature function : . The conditional probability of a segmentation s give a sequence x is defined as .

The correct segmentation s of a sentence is defined as a sequence of entity segments(DSE or ESE) and non-entity segments (they are unit length segments that are to be ignored).

Extended Semi-CRF for Opinion Expression Extraction

The objective is to learn the entity boundaries and labels for opinion expression extraction.

  • First modification, the segment length should not be fixed to maximum segment length based on observed entities, it should be unbounded to allow any length segment candidates.
  • Second,the segment units are generated from sentence parse tree.

ParseTree.png

  • Segment Construction Algorithm

SegmentAlgorithm.png Function returns true if parent node of have the same rightmost child in their subtrees, otherwise it returns false. The above generated candidate segments are then validated using.

TBD

Features:

Experimental Results

  • Token-level CRF-based approach, Breck et al.2007 is used as the baseline on MPQA dataset.

Study Plan

This paper uses semi-CRF for the labeling task. So the user should first read about semi-CRF.