Difference between revisions of "Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Ravichandran, D. and Hovy, E. 2002. Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting on Association for Comp…')
 
 
(3 intermediate revisions by the same user not shown)
Line 23: Line 23:
 
* Use patterns to answer unseen questions
 
* Use patterns to answer unseen questions
  
== Algorithm 1: Learning patterns ==
+
== Learning patterns ==
  
* Select an example QA pair e.g. “Mozart 1756”
+
The authors first select an example QA pair e.g. “Mozart 1756” and submit that to a search engine; they download the top 1000 hits and retain only sentences that contain both terms. They then pass each sentence through suffix tree constructor. This finds all substrings, of all lengths, along with their counts. For example, for the sentences:
* Submit to search engine; download top 1k hits
+
* The great composer Mozart (1756–1791) achieved fame at a young age”
* Retain only sentences that contain both terms
+
* “Mozart (1756–1791) was a genius”
 +
* The whole world would always be indebted to the great music of Mozart (1756–1791)”.
 +
The longest matching substring is “Mozart (1756–1791)”, which the suffix tree would extract as one of the outputs, with score 3. Then, for each phrase in suffix tree, they retain only those phrases that contain both terms. Some examples of learned patterns for the birth-year attribute include:
 +
* born in <ANSWER> , <NAME>
 +
* <NAME> was born on <ANSWER> ,
 +
* <NAME> ( <ANSWER> -
 +
* <NAME> ( <ANSWER - )
  
** Pass each sentence through suffix tree constructor. This finds all substrings, of all lengths, along with their counts
+
== Calculating precision of each pattern ==
 +
The authors query their corpus with only the question term, download the top 1000 hits, and retain only sentences that contain the question term. For each learned pattern, they check for its presence in the sentence for two cases:
 +
* Presence with <ANSWER> matched by any word.
 +
* Presence with <ANSWER> matched by correct A term.
 +
Then, the precision of a pattern is Ca / Co where Ca = # of patterns with correct A term, Co = # of patterns with any word as A term
 +
They retain only patterns matching some k # of examples.
  
** The great composer Mozart (1756–1791) achieved fame at a young age”
+
== Finding answers using learned patterns ==
** “Mozart (1756–1791) was a genius”
 
** The whole world would always be indebted to the great music of Mozart (1756–1791)”.
 
  
** Longest matching substring: “Mozart (1756–1791)”, which the suffix tree would extract as one of the outputs, with score 3
+
The authors determine the question type of a new question and the question term using their existing QA system. They then query the target corpus with this question term, obtain a list of sentences, and use the pattern table for that question type to search for the presence of each pattern. They select words matching the “<ANSWER>” tag as their answers. They sort answers by their generative pattern's precision scores, discard duplicates, and return the top 5 answers.
  
* For each phrase in suffix tree, retain only those phrases that contain both terms
+
== Experiments ==
* Replace the word for Q term by “<NAME>” and for A term by “<ANSWER>”.
+
The authors considered 6 different question types: BIRTHDATE, LOCATION, INVENTOR, DISCOVERER, DEFINITION, WHY-FAMOUS. They took questions from the TREC-10 dataset, and ran two experiments: one in which they used the [[UsesDataset::TREC-10]] corpus and performed IR by their own QA system, and another in which they used the web as a corpus and performed IR by using AltaVista.
* Examples of learned patterns for Birthyear:
 
** born in <ANSWER> , <NAME>
 
** <NAME> was born on <ANSWER> ,
 
** <NAME> ( <ANSWER> -
 
** <NAME> ( <ANSWER - )
 
 
 
In Maximum Entropy Markov models (MeMMs), the HMM transition and observation functions are replaced by a single function that provides the probability of the current state given the previous state and the current observation. In contrast to HMMs, in which the current observation only depends on the current state, the current observation in an MEMM may also depend on the previous state. In other words, the observations can be thought of as being associated with state transitions rather than with states. Moreover, for each state, such a transition function is given by a distinct Maxent model.
 
 
 
The paper illustrates a modification of the HMM Viterbi inference algorithm to perform inference for MeMMs. It also provides a Generalized Iterative Scaling (GIS) algorithm to train MeMMs, which is guaranteed to converge to a global maximum as long as there is at most one state per label.
 
 
 
For cases where training label sequences are unknown or ambiguous, the paper also gives a modification of the standard Baum-Welch training algorithm for HMMs.
 
  
== Variations of MeMMs ==
+
[[File:Rhovy.png]]
The paper produces several variations of the basic MeMM architecture explained above:
 
  
* Factored state representation
+
They find that their approach performs better on the Web than the TREC corpus. The abundance of data on Web compared to TREC probably makes it easier to find candidate answers.
To deal with data sparseness problem in standard MeMMs (due to <math>O(|S|^2)</math> transition parameters), one can avoid having S different transition functions (one for each state), and just maintain one function, which uses information about the previous state as features. This reduces the expressive power of the model but allows sharing of information across states and alleviates sparseness problems.
 
  
* Observations in states instead of transitions
+
== Drawbacks ==
Rather than combining transition and emission parameters into a single function, one could represent the transition probabilities as a standard multinomial, and P(S|O) using a Maxent model. This may also help with sparseness.
+
The main drawbacks with the authors' approach include:
 +
* No external knowledge added e.g. POS tags
 +
* Good for factoid questions, not so for definition or list questions
 +
* Cannot handle long-distance dependencies.
 +
* Can handle only one anchor point (the Q term) in candidate A sentence. Cannot work for Qs that requires multiple words from Q to be in the A sentence, possibly apart
 +
* Ad-hoc way of determining length of the answer
  
* Environmental model for reinforcement learning
+
== Conclusion ==
The transition function can also include an action, resulting in a model suitable for representing the environment of a reinforcement agent.
+
The authors present an elegant bootstrapping method to learn patterns that extract answers for factoid questions. This is essentially the same idea as extracting attributes of given entities (Our IE term project). This is an early paper in the field of [[AddressesProblem::Attribute Extraction]] using patterns, and it sets stage for lot of work done later that addresses its shortcomings.  
 
 
== Experiments ==
 
The authors trained 4 different types of models to classify lines from Usenet files into one of 4 categories: head, question, answer and tail. They used a set of 24 boolean features. The types of models they trained were: ME-Stateless (non-sequential Maxent), TokenHMM (a standard 4-state fully connected HMM), FeatureHMM (an HMM where the lines i.e. obsevations were replaced by their corresponding features), and the MeMM model described above. They found that the MeMM outperformed the other approaches.
 
  
== Related papers ==
+
== Related Papers, Methods and Datasets ==
  
The original CRF ([[RelatedPaper::Lafferty 2001 Conditional Random Fields]]) papers, published just one year after this paper, introduces an even more expressive model, and CRFs are now considered state-of-the-art for sequential labeling tasks. For a background on exponential models and maximum entropy, the [[RelatedPaper::Berger et al CL 1996]] paper that introduced the maximum entropy approach to NLP  is suggested. The [[RelatedPaper::Ratnaparkhi, 1996]] paper on Maxent POS tagging also introduces a closely related model.
+
{{#ask: [[AddressesProblem::Attribute Extraction]]
 +
| ?UsesMethod
 +
| ?UsesDataset
 +
}}

Latest revision as of 23:48, 30 November 2010

Citation

Ravichandran, D. and Hovy, E. 2002. Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 41--47

Online version

An online version of this paper is available [1].

Summary

This paper introduces a method to learn surface patterns from text and use them to answer factoid style questions.

Motivation

The authors consider the problem of answering factoid questions (e.g. When was X born?). Some possible approaches for this problem include using NER, WordNet, parsers, hand-tagged corpora, and ontology lists. It turns out that a very powerful approach is to use surface text patterns, e.g. “<NAME> was born in <BIRTHDATE>”, “<NAME> (<BIRTHDATE>–”. One could create such patterns manually, but this is time-consuming, and one needs to create patterns afresh for each new attribute. Also, there's a limit on how much data can be looked at, and the method is not very accurate. The authors therefore try to learn such patterns automatically and with high precision.

Big Picture

  • Use bootstrapping to build a large set of patterns starting with only a few examples of QA pairs.
  • Rank learned patterns in order of precision
  • Use patterns to answer unseen questions

Learning patterns

The authors first select an example QA pair e.g. “Mozart 1756” and submit that to a search engine; they download the top 1000 hits and retain only sentences that contain both terms. They then pass each sentence through suffix tree constructor. This finds all substrings, of all lengths, along with their counts. For example, for the sentences:

  • The great composer Mozart (1756–1791) achieved fame at a young age”
  • “Mozart (1756–1791) was a genius”
  • The whole world would always be indebted to the great music of Mozart (1756–1791)”.

The longest matching substring is “Mozart (1756–1791)”, which the suffix tree would extract as one of the outputs, with score 3. Then, for each phrase in suffix tree, they retain only those phrases that contain both terms. Some examples of learned patterns for the birth-year attribute include:

  • born in <ANSWER> , <NAME>
  • <NAME> was born on <ANSWER> ,
  • <NAME> ( <ANSWER> -
  • <NAME> ( <ANSWER - )

Calculating precision of each pattern

The authors query their corpus with only the question term, download the top 1000 hits, and retain only sentences that contain the question term. For each learned pattern, they check for its presence in the sentence for two cases:

  • Presence with <ANSWER> matched by any word.
  • Presence with <ANSWER> matched by correct A term.

Then, the precision of a pattern is Ca / Co where Ca = # of patterns with correct A term, Co = # of patterns with any word as A term They retain only patterns matching some k # of examples.

Finding answers using learned patterns

The authors determine the question type of a new question and the question term using their existing QA system. They then query the target corpus with this question term, obtain a list of sentences, and use the pattern table for that question type to search for the presence of each pattern. They select words matching the “<ANSWER>” tag as their answers. They sort answers by their generative pattern's precision scores, discard duplicates, and return the top 5 answers.

Experiments

The authors considered 6 different question types: BIRTHDATE, LOCATION, INVENTOR, DISCOVERER, DEFINITION, WHY-FAMOUS. They took questions from the TREC-10 dataset, and ran two experiments: one in which they used the TREC-10 corpus and performed IR by their own QA system, and another in which they used the web as a corpus and performed IR by using AltaVista.

Rhovy.png

They find that their approach performs better on the Web than the TREC corpus. The abundance of data on Web compared to TREC probably makes it easier to find candidate answers.

Drawbacks

The main drawbacks with the authors' approach include:

  • No external knowledge added e.g. POS tags
  • Good for factoid questions, not so for definition or list questions
  • Cannot handle long-distance dependencies.
  • Can handle only one anchor point (the Q term) in candidate A sentence. Cannot work for Qs that requires multiple words from Q to be in the A sentence, possibly apart
  • Ad-hoc way of determining length of the answer

Conclusion

The authors present an elegant bootstrapping method to learn patterns that extract answers for factoid questions. This is essentially the same idea as extracting attributes of given entities (Our IE term project). This is an early paper in the field of Attribute Extraction using patterns, and it sets stage for lot of work done later that addresses its shortcomings.

Related Papers, Methods and Datasets