Difference between revisions of "Semantic Role Labeling with CRFs"

Latest revision as of 22:11, 15 October 2011

Citation

Trevor Cohn, Philip Blunsom, "Semantic Role Labeling with Conditional Random Fields", CoNLL 2005

Online version

Introduction

This paper aims at Semantic Role Labeling or SRL of sentences using Conditional Random Fields. This was the first attempt of solving the problem of SRL using CRF. The authors defined CRF over the tree structure of the syntactic parse tree of the sentence, rather than defining it on the linear sentence structure as is usually done for the tasks of Named Entity Recognition or Part-of-Speech tagging. The motivation behind this came from the very nature of semantic role labeling which is the task of labeling phrases with their semantic labels with respect to a particular constituent of the sentence, the predicate or the verb. The authors conjectured that for this reason, modeling linear chain CRF was not intuitive for SRL. The problem of SRL is usually broken into two parts: identifying candidate phrases for assigning semantic roles, and predicting the semantic role to be assigned to the identified phrase. The approach in this paper does both these things in a single pass over the syntactic tree structure.

Dataset Used

The dataset used was the PropBank corpus, which is the Penn Treebank corpus with semantic role annotation.

CRF Model

The CRF was defined over the tree structure of the sentence as:

where $C$ is the set of cliques in the observation tree, $\lambda {_{k}}$ are model's parameters, and $f$ is the function that maps label for a clique to a vector of scalar values.
The cliques considered were single-node (just one node in the syntactic tree), and two-node (parent and child nodes) ones. The CRF model can thus be restated as

where the actual feature function $f$ is divided into single-node feature function $g$ , and two-node feature function $h$ .

Features Used

As the cliques considered are single-node and two-node cliques, the features were also defined for both single nodes and parent-child pairs. There were many syntactic features used; I will not be describing each of them as the reference for them can be found in the paper. The syntactic features or the feature types were made into binary functions $g$ and $h$ by combining (feature type, feature value) pairs with label (for a single node) or label pairs (for two-noded cliques), when such a feature-type, feature-value was seen at least once in the training data.
The different feature types used were:
Basic features: {Head word, head PoS, phrase syntactic category, phrase path, position relative to the predicate, surface distance to the predicate, predicate lemma, predicate token, predicate voice, predicate sub-categorisation, syntactic frame}.
Context features: {Head word of first NP in preposition phrase, left and right sibling head words and syntactic categories, first and last word in phrase yield and their PoS, parent syntactic category and head word}.
Common ancestor of the verb: The syntactic category of the deepest shared ancestor of both the verb and node.
Feature conjunctions: The following features were conjoined: { predicate lemma + syntactic category, predicate lemma + relative position, syntactic category + first word of the phrase}.
Default feature: This feature is always on, which allows the classifier to model the prior probability distribution over the possible argument labels.
Joint features: These features were only defined over pair-wise cliques: {whether the parent and child head words do not match, parent syntactic category + and child syntactic category, parent relative position + child relative position, parent relative position + child relative position + predicate PoS + predicate lemma}.

Experimental Results and Conclusion

The parsed training data sentences yielded 90,388 predicates and 1,971,985 binary features ( $g$ and $h$ ). The experimental results of precision, recall and f-scores are shown in the table below.

Although the modeling of the problem is neat, the results reported were not at par with the best systems that competed in the CoNLL shared task. Marquez et. al. in their paper showed that modeling the SRL problem as a sequential BIO-tagging problem still gives far better results. They made use of a combination of deep and shallow syntactic features and used boosting technique for the BIO-tagging.

Comments

Any ideas why their approach doesn't work as well as BIO tagging? That is an interesting result. --Brendan 18:55, 13 October 2011 (UTC)

Response to the Comment (by Manaj)

Well, I guess this might have to do something with the kind of problem and the approach taken. Modeling a sequential labeling problem (such as SRL) with CRF should give good results when modeled over sequential structures. However, here CRF is modeled over syntactic tree structures. The authors thought that it would make sense since the arguments in SRL are always relative to a predicate (verb), and the features generally used are the syntactic features. However, it turned out that the results were not as great as many other techniques applied, including SVM (seethis or this). This brings me to thinking that CRF over tree structures might not be a good representation of the problem itself. Coming to its comparison with the sequential BIO-tagging in Semantic Role Labeling as Sequential Tagging, the most likely reasons why the latter outperformed the former significantly could be because of the use of a combination of features (syntactic features and chunk-features) and the usage of Ada-boost with decision trees. Its worth mentioning how this empirical study proved Decision Trees to be among the better performing models.

@@ Line 6: / Line 6: @@
 [http://acl.ldc.upenn.edu/W/W05/W05-06.pdf#page=183 Click here to download]
-== Summary ==
+== Introduction ==
-This [[Category::paper]] aims at [[AddressesProblem::analyzing comments]] made on videos hosted on Youtube, and predicting the ratings that users give to these comments. The ratings are basically number of people liking (positive rating) or disliking (negative rating) the comments made by other users. The authors refer to comments that have positive rating as accepted comments and those having negative ratings as unaccepted comments. The motivation is basically finding the sentiment of a comment, with the conjecture being that comment with "positive" sentiment tends to have positive rating, whereas one with "negative" sentiment tends to have negative rating. The authors also perform few other experiments to see the correlation between variance of ratings with polarity (more polar a video, more polar are people's opinions about it) of the videos, and the dependency of ratings and sentiment values of comments on videos of different categories. Please see the [[UsesDataset::Youtube comment analysis dataset]] page for information about dataset.
+This [[Category::paper]] aims at [[AddressesProblem::Semantic Role Labeling]] or SRL of sentences using [[UsesMethod::Conditional Random Fields]]. This was the first attempt of solving the problem of SRL using CRF. The authors defined CRF over the tree structure of the syntactic parse tree of the sentence, rather than defining it on the linear sentence structure as is usually done for the tasks of Named Entity Recognition or Part-of-Speech tagging. The motivation behind this came from the very nature of semantic role labeling which is the task of labeling phrases with their semantic labels with respect to a particular constituent of the sentence, the predicate or the verb. The authors conjectured that for this reason, modeling linear chain CRF was not intuitive for SRL.
+The problem of SRL is usually broken into two parts: identifying candidate phrases for assigning semantic roles, and predicting the semantic role to be assigned to the identified phrase. The approach in this paper does both these things in a single pass over the syntactic tree structure.
-==Sentiment Analysis of Rated Comments==
+==Dataset Used==
-The authors first analyzed the comments for their sentiments to prove their hypothesis that positively rated comments have positive sentiment and vice-versa. They first categorized the comments into three categories "5Neg" (comments that have a negative rating of 5 or higher), "0Dist" (comments that have not got any rating) and "5Pos" (comments that have a positive rating of 5 or higher). Then the terms in these comments were assigned a sentiment score using SentiWordNet. SentiWordNet has a score triplet in the form of (positivity-score,negativity-score,objectivity-score) for each word present in WordNet. The authors just considered the adjectives present in the comments to be tagged for their sentiment scores. Experiments showed that negatively rated comments had more negative sentiment terms, and positively rated comments had more positive sentiment terms. Authors further did an "analysis of variance" test to prove that the mean of sentiment scores for the three categories varied significantly across any two categories.
+The dataset used was the [[UsesDataset::PropBank]] corpus, which is the Penn Treebank corpus with semantic role annotation.
-==Predicting Rating for Comments==
+==CRF Model==
-After the above analysis, the authors did an SVM based classification of the comments. The comment was considered as a vector of sentiment values of the terms present in the comment. The classification was binary with the classes being positive/accepted or negative/unaccepted. For this experiment, authors considered distinct thresholds for the minimum and maximum ratings (above/below +2/-2, +5/-5, +7/-7) for comments to be considered accepted or unaccepted. The authors also chose different amounts of randomly chosen accepted and unaccepted comments (T=1000,10000,50000,200000) for training. At least 1000 comments in each of the classes were kept for testing. Three experiments were conducted. First was classification with accepted comments marked as accepted, and unaccepted comments marked as unaccepted. Second was classification with accepted comments marked as unaccepted, and unaccepted comments marked as accepted; this was done to find the "bad" or erroneous comments. The third experiment was with comments with high rating (positive or negative) and the ones with no rating. The three scenarios are labeled AC_POS, AC_NEG and THRES-0 in the results below.
+The CRF was defined over the tree structure of the sentence as:<br>
+[[File:crf_coh.jpg]]
-==Results for Rating Prediction==
-[[File:Results_youtube.jpg]]
-==Variance of Comments Rating as Indicator of Polarizing topics==
+where <math>C</math> is the set of cliques in the observation tree, <math>\lambda{_k}</math> are model's parameters, and <math>f</math> is the function that maps label for a clique to a vector of scalar values.<br>
-The authors also analyzed the relation between variance of comments rating and the polarity of videos. 1413 tags from 50 videos were selected and average variance of comment ratings was calculated over all videos having a particular tag. The table below shows top-25 and bottom-25 tags according to the average variance. We can see that tags in top-25 videos tend to be related to more polarizing topics, and the ones in bottom-25 videos tend to be related to rather neutral topics.
+The cliques considered were single-node (just one node in the syntactic tree), and two-node (parent and child nodes) ones. The CRF model can thus be restated as<br>
+[[File:crf_coh_alt.jpg]]
-[[File:Results_youtube_variance.jpg]]
+where the actual feature function <math>f</math> is divided into single-node feature function <math>g</math>, and two-node feature function <math>h</math>.
-==Category Dependencies of Ratings==
+==Features Used==
-Authors conducted the classification experiments separately for comments in three different categories: Music, Entertainment, and News & Politics. The results of these experiments are as shown in the figure below. While classification did comparably well for Entertainment and Music categories, it didn't do that well for News & Politics category. In addition, authors also found mean rating scores for comments in a variety of categories. The results are as in the figure below.
+As the cliques considered are single-node and two-node cliques, the features were also defined for both single nodes and parent-child pairs. There were many syntactic features used; I will not be describing each of them as the reference for them can be found in the paper. The syntactic features or the feature types were made into binary functions <math>g</math> and <math>h</math> by combining (feature type, feature value) pairs with label (for a single node) or label pairs (for two-noded cliques), when such a feature-type, feature-value was seen at least once in the training data.<br>
+The different feature types used were:<br>
+<b>Basic features</b>: {Head word, head PoS, phrase syntactic category, phrase path, position relative to the predicate, surface distance to the predicate, predicate lemma, predicate token, predicate voice, predicate sub-categorisation, syntactic frame}.<br>
+<b>Context features</b>: {Head word of first NP in preposition phrase, left and right sibling head words and syntactic categories, first and last word in phrase yield and their PoS, parent syntactic category and head word}.<br>
+<b>Common ancestor of the verb</b>: The syntactic category of the deepest shared ancestor of both the verb and node.<br>
+<b>Feature conjunctions</b>: The following features were conjoined: { predicate lemma + syntactic category, predicate lemma + relative position, syntactic category + first word of the phrase}.<br>
+<b>Default feature</b>: This feature is always on, which allows the classifier to model the prior probability distribution over the  possible argument labels.<br>
+<b>Joint features</b>: These features were only defined over pair-wise cliques: {whether the parent and child head words do not match, parent syntactic category + and child syntactic category, parent relative position + child relative position, parent relative position + child relative position + predicate PoS + predicate lemma}.
-[[File:Results_youtube_categories.jpg]]
+==Experimental Results and Conclusion==
-[[File:Results_youtube_categories-2.jpg]]
+The parsed training data sentences yielded 90,388 predicates and 1,971,985 binary features (<math>g</math> and <math>h</math>). The experimental results of precision, recall and f-scores are shown in the table below.<br>
+[[File:cohn_results.jpg]]
+<br><br>Although the modeling of the problem is neat, the results reported were not at par with the best systems that competed in the CoNLL shared task. Marquez et. al. in their [[Semantic_Role_Labeling_as_Sequential_Tagging|paper]] showed that modeling the SRL problem as a sequential BIO-tagging problem still gives far better results. They made use of a combination of deep and shallow syntactic features and used boosting technique for the BIO-tagging.
+== Comments ==
+Any ideas why their approach doesn't work as well as BIO tagging?  That is an interesting result.  --[[User:Brendan|Brendan]] 18:55, 13 October 2011 (UTC)
+----
+'''Response to the Comment''' (by [[User:manajs|Manaj]])
+Well, I guess this might have to do something with the kind of problem and the approach taken. Modeling a sequential labeling problem (such as SRL) with CRF should give good results when modeled over sequential structures. However, here CRF is modeled over syntactic tree structures. The authors thought that it would make sense since the arguments in SRL are always relative to a predicate (verb), and the features generally used are the syntactic features. However, it turned out that the results were not as great as many other techniques applied, including SVM (see[http://www.cemantix.org/papers/pradhan-hlt-2004-a.pdf this] or [http://www.lsi.upc.edu/~srlconll/st05/papers/intro.pdf this]). This brings me to thinking that CRF over tree structures might not be a good representation of the problem itself. Coming to its comparison with the sequential BIO-tagging in [[Semantic Role Labeling as Sequential Tagging]], the most likely reasons why the latter outperformed the former significantly could be because of the use of a combination of features (syntactic features and chunk-features) and the usage of Ada-boost with decision trees. Its worth mentioning how [http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf this] empirical study proved Decision Trees to be among the better performing models.

Difference between revisions of "Semantic Role Labeling with CRFs"

Latest revision as of 22:11, 15 October 2011

Contents

Citation

Online version

Introduction

Dataset Used

CRF Model

Features Used

Experimental Results and Conclusion

Comments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools