Difference between revisions of "Joint Inference in Information Extraction"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 7: Line 7:
  
 
== Introduction ==
 
== Introduction ==
This [[Category::paper]] aims at solving the problem of [[AddressesProblem::Citation Matching]]. <br>The approach makes use of [[UsesMethod::Markov Logic Networks]] or MLN. This can be thought of as an extension to rule-based approach, where every rule has a weight with it. The nodes in an MLN are ground atoms, and the edges are ground clauses (see the method page  for details). Each clause has a weight associated with it. Each state in the MLN has a probability <br><math>P(x) = (1/Z) exp({\Sigma}w{_i}f{_i}(x))</math>, where Z is a normalization constant, w<math>{_i}</math> is the weight of the i-th clause, f<math>{_i}</math> = 1 if the i-th clause is true, and f<math>{_i}</math> = 0 otherwise. The weights are learned by iteratively optimizing pseudo-likelihood on a train databse/corpus. One way to think about MLNs as an amalgamation of probabilistic graphical models and first order logic.<br>
+
This [[Category::paper]] aims at solving the problem of [[AddressesProblem::Citation Matching]]. <br>The approach makes use of [[UsesMethod::Markov Logic Networks]] or MLN. The authors decided some rules for segmenting the citation into certain entities (title of the publication, authors and venue) and also identifying those entities, which were represented through an MLN. Three variants of solutions for inference were tried: an isolated inference for segmentation, a joint inference for segmentation, in which one inferred segmentation takes part in inferring another, and a joint inference of segmentation and recognition. The results were compared with existing baselines and were found to outdo baseline, although with a meager margin.
The authors decided some rules for segmenting the citation into certain entities (title of the publication, authors and venue) and also identifying those entities, which were represented through an MLN. Three variants of solutions for inference were tried: an isolated inference for segmentation, a joint inference for segmentation, in which one inferred segmentation takes part in inferring another, and a joint inference of segmentation and recognition. The results were compared with existing baselines and were found to outdo baseline, although with a meager margin.
 
  
 
==Dataset Used==
 
==Dataset Used==
The dataset used were [[UsesDataset::Standard Citation Datasets]] Citesteer and Cora. These datasets contain a number of citations, with duplicate citations clustered together.
+
The dataset used were [[UsesDataset::Standard Citation Datasets]] CiteSeer and Cora. These datasets contain a number of citations, with duplicate citations clustered together.
  
 
==MLN for Joint Citation Matching==
 
==MLN for Joint Citation Matching==
The main predicate used in the MLN is <b><i>Token(t, i, c)</i></b>, which is true iff token t appears in the ith position of the cth citation. A token can be a word, date, number, etc. Punctuation marks are not treated as separate tokens; rather, the predicate <b><i>HasPunc(c, i)</i></b> is true iff a punctuation mark appears immediately after the ith position in the cth citation. Two "query" predicates are used (query predicates are the ones whose truth values are to be inferred):<br>
+
The main predicate used in the MLN is <i>Token(t, i, c)</i>, which is true iff token t appears in the ith position of the cth citation. A token can be a word, date, number, etc. Punctuation marks are not treated as separate tokens; rather, the predicate <i>HasPunc(c, i)</i> is true iff a punctuation mark appears immediately after the ith position in the cth citation. Two "query" predicates are used (query predicates are the ones whose truth values are to be inferred):<br>
<b><i>InField(i, f, c)</i></b>, which is true iff i-th position of c-th citation is a field f, where f <math>\Epsilon</math> {Title,Author,Venue}, and<br>
+
<i>InField(i, f, c)</i>, which is true iff i-th position of c-th citation is a field f, where f <math>\Epsilon</math> {Title,Author,Venue}, and<br>
<b><i>SameCitation(c, c′)</i></b> which is true iff citations c and c' represent the same publication, and inferring this predicate performs entity resolution.
+
<i>SameCitation(c, c′)</i> which is true iff citations c and c' represent the same publication, and inferring this predicate performs entity resolution.
  
'''Isolated Segmented Model'''
+
====Isolated Segmentation Model====
The first model that the authors tried to solve the problem with was to segment the citations without any kind of joint inference. Their segmentation model is essentially an HMM, where observation matrix is defined by the logical formula:
+
The first model that the authors tried to solve the problem with, was to segment the citations without any kind of joint inference. Their segmentation model is essentially an HMM, where observation matrix and trnasition matrix are defined by certain logical formulas. For this model for identifying a segment, the observation matrix is defined by the logical formula:<br>
<align="center">Token(+t, i, c) ⇒ InField(i, +f, c)</align>
+
<i>Token(+t, i, c) ⇒ InField(i, +f, c)</i><br>
where all free variables are implicitly universally quantified.
+
The “+t, +f” notation signifies that the MLN contains an instance of this rule for each (token, field) pair. If this rule was learned in isolation, the weight of the (t,f)th instance would be log(p<math>{_tf}</math> /(1−p<math>{_tf}</math> )), where p<math>{_tf}</math> is the corresponding entry in the HMM observation matrix.
The “+t, +f” notation signifies that the MLN contains an
+
The transition matrix, on the other hand, is defined by the below logical formula:<br>
instance of this rule for each (token, field) pair. If this rule
+
<i>InField(i, +f, c) ⇒ InField(i + 1, +f',c')</i><br>
was learned in isolation, the weight of the (t, f)th instance
+
The inclusion of token boundary in the above formulas for finding the token in a filed is as below:<br>
would be log(ptf /(1−ptf )), where ptf is the corresponding
+
<i>InField(i, +f, c) ∧ ¬HasPunc(c, i) ⇒ InField(i + 1, +f, c)</i><br>
entry in the HMM observation matrix.
+
In addition to the above rules, the following rules were also used: the first two positions of a citation are usually in the author field, and the middle one in the title; initials (e.g., “J.”) tend to appear in either the author or the venue field; positions preceding the last non-venue initial are usually not part of the title or venue; and positions after the first venue keyword (e.g.,
In general, the transition matrix of the HMM is represented by a rule of the form
+
“Proceedings”, “Journal”) are usually not part of the author or title.
InField(i, +f, c) ⇒ InField(i + 1, +f
+
====Entity Resolution Model====
+
The Entity Resolution/Recognition model contains rules of the form: if two fields contain many common tokens, they are the same; if the fields of two citations match, the citations also match, and vice-versa; etc. Simply taking the output InField() predicates of the segmentation MLN as evidence to this MLN would constitute a standard pipeline model. Merging the two MLNs produces a joint model for segmentation and entity resolution.<br> However, the problem with this pipeline is that entity resolution often affects segmentation in a joint model. Since only a small fraction of citation pairs (c, c′) match, in the absence of strong evidence to the contrary the
, c)
+
MLN will conclude that SameCitation(c, c′) is false. If SameCitation(c, c′) is the consequent of a rule (or rule chain) with InField() in the antecedent, the MLN may infer that InField() is false, even if segmentation alone would correctly predict it to be true.<br>
 +
Therefore, the authors defined additional rules that would not simply take InField as an antecedent rule.<br>
 +
A rule <i>SimilarTitle(c, i, j, c′, i′, j′)</i> was defined, which is true if citations c and c′ contain similar title like strings at positions i to j and i′ to j′, respectively. A string is title-like if it does not contain punctuation and does not match the “title exclusion” as defined above in isolated segmentation model.
 +
The authors held that if two citations have similar titles, yet different venues, they still represent the same citation. Hence, the rule:<br>
 +
<i>SimilarTitle(c, i, j, c′, i′, j′) ∧ SimilarVenue(c, c′) ⇒ SameCitation(c, c′)<br>
  
 +
====Joint Segmentation Model====
 +
This model aims at joint segmentation, where citations are segmented collectively, as opposed to in isolation. A predicate<br>
 +
<i>JointInferenceCandidate(c, i, c′)</i> was defined to be true if the trigram starting at position i in citation c also appears somewhere in citation c′, the trigrams do not match the “title exclusion” rules, and the trigram in c is not preceded by punctuation,
 +
while in c′ it is. This is used to extend the segmentation model by adding the following condition:<br>
 +
<i>InField(i, +f, c) ∧ ¬HasPunc(c, i)∧(¬∃c′JointInferenceCandidate(c, i, c′)) ⇒ InField(i + 1, +f, c)</i><br>
 +
This rule says that a field inferred for position i is extended to the next position if that field is not a punctuation and does not satisfy a "title" (or JointInferenceCandidate) rule.<br>
 +
However, if one uses this rule to do a segmentation for the pair of citations given below, the titles won't match, since the title token started with "On" will not be able to match with title of the second citation, when both citations are essentially the same.<br>
 +
R. Schapire. On the strength of weak learnability. Proceedings of the 30th I.E.E.E. Symposium on the Foundations of Computer Science, 1989, pp. 28-33.<br>
 +
Robert E. Schapire. 5(2) The strength of weak learnability. Machine Learning, 1990 197-227<br>
 +
To take care of such cases, another rule was included, which could take advantage of the joint inference,ie, extend the segment to the next position if there's no matching citation for the given citation:<br>
 +
<i>InField(i, +f, c) ∧ ¬HasPunc(c, i) ∧(¬∃c′ JointInferenceCandidate(c, i, c′)∧SameCitation(c, c′)) ⇒ InField(i + 1, +f, c)</i><br>
  
==Features Used==
+
==Experiments and Results==
As the cliques considered are single-node and two-node cliques, the features were also defined for both single nodes and parent-child pairs. There were many syntactic features used; I will not be describing each of them as the reference for them can be found in the paper. The syntactic features or the feature types were made into binary functions <math>g</math> and <math>h</math> by combining (feature type, feature value) pairs with label (for a single node) or label pairs (for two-noded cliques), when such a feature-type, feature-value was seen at least once in the training data.<br>
+
For CiteSeer data, four-fold cross-validation was done, while for Cora data, three-fold cross-validation was done. The results in F1 for identification of individual and all entities (title,author,venue) taken together for CiteSeer and Cora datasets are as given in Table-1 and Table-2 below, respectively. "All" means all-citations, "Non-trivial" means citations that had at least one duplicate, "Potential" means citations with poor author-title boundary (e.g. with a punctuation missing after the author's last name and title's first word).
The different feature types used were:<br>
 
<b>Basic features</b>: {Head word, head PoS, phrase syntactic category, phrase path, position relative to the predicate, surface distance to the predicate, predicate lemma, predicate token, predicate voice, predicate sub-categorisation, syntactic frame}.<br>
 
<b>Context features</b>: {Head word of first NP in preposition phrase, left and right sibling head words and syntactic categories, first and last word in phrase yield and their PoS, parent syntactic category and head word}.<br>
 
<b>Common ancestor of the verb</b>: The syntactic category of the deepest shared ancestor of both the verb and node.<br>
 
<b>Feature conjunctions</b>: The following features were conjoined: { predicate lemma + syntactic category, predicate lemma + relative position, syntactic category + first word of the phrase}.<br>
 
<b>Default feature</b>: This feature is always on, which allows the classifier to model the prior probability distribution over the  possible argument labels.<br>
 
<b>Joint features</b>: These features were only defined over pair-wise cliques: {whether the parent and child head words do not match, parent syntactic category + and child syntactic category, parent relative position + child relative position, parent relative position + child relative position + predicate PoS + predicate lemma}.
 
  
==Experimental Results and Conclusion==
+
<b>Table-1</b><br>
The parsed training data sentences yielded 90,388 predicates and 1,971,985 binary features (<math>g</math> and <math>h</math>). The experimental results of precision, recall and f-scores are shown in the table below.<br>
+
[[File:poon_joint_inference_result_1.jpg]]<br>
[[File:cohn_results.jpg]]
+
<b>Table-2</b><br>
<br><br>Although the modeling of the problem is neat, the results reported were not at par with the best systems that competed in the CoNLL shared task. Marquez et. al. in their [[Semantic_Role_Labeling_as_Sequential_Tagging|paper]] showed that modeling the SRL problem as a sequential BIO-tagging problem still gives far better results. They made use of a combination of deep and shallow syntactic features and used boosting technique for the BIO-tagging.
+
[[File:poon_joint_inference_result_2.jpg]]
 
 
== Comments ==
 
 
 
Any ideas why their approach doesn't work as well as BIO tagging?  That is an interesting result.  --[[User:Brendan|Brendan]] 18:55, 13 October 2011 (UTC)
 
 
 
----
 
'''Response to the Comment''' (by [[User:manajs|Manaj]])
 
 
 
Well, I guess this might have to do something with the kind of problem and the approach taken. Modeling a sequential labeling problem (such as SRL) with CRF should give good results when modeled over sequential structures. However, here CRF is modeled over syntactic tree structures. The authors thought that it would make sense since the arguments in SRL are always relative to a predicate (verb), and the features generally used are the syntactic features. However, it turned out that the results were not as great as many other techniques applied, including SVM (see[http://www.cemantix.org/papers/pradhan-hlt-2004-a.pdf this] or [http://www.lsi.upc.edu/~srlconll/st05/papers/intro.pdf this]). This brings me to thinking that CRF over tree structures might not be a good representation of the problem itself. Coming to its comparison with the sequential BIO-tagging in [[Semantic Role Labeling as Sequential Tagging]], the most likely reasons why the latter outperformed the former significantly could be because of the use of a combination of features (syntactic features and chunk-features) and the usage of Ada-boost with decision trees. Its worth mentioning how [http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf this] empirical study proved Decision Trees to be among the better performing models.
 

Latest revision as of 06:37, 7 December 2011

Citation

Hoifung Poon & Pedro Domingos, "Joint Inference in Information Extraction", 27th National Conference on Artificial Intelligence (AAAI), 2007

Online version

Click here to download

Introduction

This paper aims at solving the problem of Citation Matching.
The approach makes use of Markov Logic Networks or MLN. The authors decided some rules for segmenting the citation into certain entities (title of the publication, authors and venue) and also identifying those entities, which were represented through an MLN. Three variants of solutions for inference were tried: an isolated inference for segmentation, a joint inference for segmentation, in which one inferred segmentation takes part in inferring another, and a joint inference of segmentation and recognition. The results were compared with existing baselines and were found to outdo baseline, although with a meager margin.

Dataset Used

The dataset used were Standard Citation Datasets CiteSeer and Cora. These datasets contain a number of citations, with duplicate citations clustered together.

MLN for Joint Citation Matching

The main predicate used in the MLN is Token(t, i, c), which is true iff token t appears in the ith position of the cth citation. A token can be a word, date, number, etc. Punctuation marks are not treated as separate tokens; rather, the predicate HasPunc(c, i) is true iff a punctuation mark appears immediately after the ith position in the cth citation. Two "query" predicates are used (query predicates are the ones whose truth values are to be inferred):
InField(i, f, c), which is true iff i-th position of c-th citation is a field f, where f {Title,Author,Venue}, and
SameCitation(c, c′) which is true iff citations c and c' represent the same publication, and inferring this predicate performs entity resolution.

Isolated Segmentation Model

The first model that the authors tried to solve the problem with, was to segment the citations without any kind of joint inference. Their segmentation model is essentially an HMM, where observation matrix and trnasition matrix are defined by certain logical formulas. For this model for identifying a segment, the observation matrix is defined by the logical formula:
Token(+t, i, c) ⇒ InField(i, +f, c)
The “+t, +f” notation signifies that the MLN contains an instance of this rule for each (token, field) pair. If this rule was learned in isolation, the weight of the (t,f)th instance would be log(p /(1−p )), where p is the corresponding entry in the HMM observation matrix. The transition matrix, on the other hand, is defined by the below logical formula:
InField(i, +f, c) ⇒ InField(i + 1, +f',c')
The inclusion of token boundary in the above formulas for finding the token in a filed is as below:
InField(i, +f, c) ∧ ¬HasPunc(c, i) ⇒ InField(i + 1, +f, c)
In addition to the above rules, the following rules were also used: the first two positions of a citation are usually in the author field, and the middle one in the title; initials (e.g., “J.”) tend to appear in either the author or the venue field; positions preceding the last non-venue initial are usually not part of the title or venue; and positions after the first venue keyword (e.g., “Proceedings”, “Journal”) are usually not part of the author or title.

Entity Resolution Model

The Entity Resolution/Recognition model contains rules of the form: if two fields contain many common tokens, they are the same; if the fields of two citations match, the citations also match, and vice-versa; etc. Simply taking the output InField() predicates of the segmentation MLN as evidence to this MLN would constitute a standard pipeline model. Merging the two MLNs produces a joint model for segmentation and entity resolution.
However, the problem with this pipeline is that entity resolution often affects segmentation in a joint model. Since only a small fraction of citation pairs (c, c′) match, in the absence of strong evidence to the contrary the MLN will conclude that SameCitation(c, c′) is false. If SameCitation(c, c′) is the consequent of a rule (or rule chain) with InField() in the antecedent, the MLN may infer that InField() is false, even if segmentation alone would correctly predict it to be true.
Therefore, the authors defined additional rules that would not simply take InField as an antecedent rule.
A rule SimilarTitle(c, i, j, c′, i′, j′) was defined, which is true if citations c and c′ contain similar title like strings at positions i to j and i′ to j′, respectively. A string is title-like if it does not contain punctuation and does not match the “title exclusion” as defined above in isolated segmentation model. The authors held that if two citations have similar titles, yet different venues, they still represent the same citation. Hence, the rule:
SimilarTitle(c, i, j, c′, i′, j′) ∧ SimilarVenue(c, c′) ⇒ SameCitation(c, c′)

Joint Segmentation Model

This model aims at joint segmentation, where citations are segmented collectively, as opposed to in isolation. A predicate
JointInferenceCandidate(c, i, c′) was defined to be true if the trigram starting at position i in citation c also appears somewhere in citation c′, the trigrams do not match the “title exclusion” rules, and the trigram in c is not preceded by punctuation, while in c′ it is. This is used to extend the segmentation model by adding the following condition:
InField(i, +f, c) ∧ ¬HasPunc(c, i)∧(¬∃c′JointInferenceCandidate(c, i, c′)) ⇒ InField(i + 1, +f, c)
This rule says that a field inferred for position i is extended to the next position if that field is not a punctuation and does not satisfy a "title" (or JointInferenceCandidate) rule.
However, if one uses this rule to do a segmentation for the pair of citations given below, the titles won't match, since the title token started with "On" will not be able to match with title of the second citation, when both citations are essentially the same.
R. Schapire. On the strength of weak learnability. Proceedings of the 30th I.E.E.E. Symposium on the Foundations of Computer Science, 1989, pp. 28-33.
Robert E. Schapire. 5(2) The strength of weak learnability. Machine Learning, 1990 197-227
To take care of such cases, another rule was included, which could take advantage of the joint inference,ie, extend the segment to the next position if there's no matching citation for the given citation:
InField(i, +f, c) ∧ ¬HasPunc(c, i) ∧(¬∃c′ JointInferenceCandidate(c, i, c′)∧SameCitation(c, c′)) ⇒ InField(i + 1, +f, c)

Experiments and Results

For CiteSeer data, four-fold cross-validation was done, while for Cora data, three-fold cross-validation was done. The results in F1 for identification of individual and all entities (title,author,venue) taken together for CiteSeer and Cora datasets are as given in Table-1 and Table-2 below, respectively. "All" means all-citations, "Non-trivial" means citations that had at least one duplicate, "Potential" means citations with poor author-title boundary (e.g. with a punctuation missing after the author's last name and title's first word).

Table-1
Poon joint inference result 1.jpg
Table-2
Poon joint inference result 2.jpg