Difference between revisions of "Automated Template Extraction"

From Cohen Courses
Jump to navigationJump to search
Line 8: Line 8:
 
Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.
 
Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.
  
The paper referenced below by Chambers and Jurafsky is what I plan to use as a "jumping-off" point, so to speak.  
+
The paper referenced below by Chambers and Jurafsky is what we plan to use as a "jumping-off" point, so to speak.  
  
I'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.
+
We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.
  
== Baseline ==
+
== Baseline & Dataset ==
  
Given that this is a fairly novel approach, I'm not sure how easy it will be to find a baseline. I suppose it will depend on the final project methodology - if the focus is solely on the automated template extraction, it would be reasonable to attempt to compare a standard IE system and "hand-written" or some other "gold standard" templates with the automatically generated templates. It's something that will need to be given some thought.
+
(We're still a little bit unsure about this)
  
== Dataset ==
+
The Chambers and Jurafsky paper uses the [[Uses-Dataset::MUC|MUC 4]] data set on terrorism. We could use any of the [http://www-nlpir.nist.gov/related_projects/muc/ MUC datasets]. [http://en.wikipedia.org/wiki/Message_Understanding_Conference General MUC dataset information]. Another possibility would be to show the power of extracting templates automatically by expanding it to work on a non-standard IE dataset.
  
I'm still hunting around for a good dataset to use for this problem.
+
In terms of a baseline, the methodology from the Chambers and Jurafsky is a good start, but it will depend on what dataset we'll choose to use. If we use MUC 4 and decide to improve upon the methodology around that dataset, then the baseline from Chambers and Jurafsky will be sufficient. The other option is to use a different dataset, in which case we'll use some "standard" template-based IE methods (admittedly, we haven't yet narrowed down what those methods will be)
 
 
The Chambers and Jurafsky paper uses the [[Uses-Dataset::MUC|MUC 4]] data set on terrorism. We could use any of the [http://www-nlpir.nist.gov/related_projects/muc/ MUC datasets]. [http://en.wikipedia.org/wiki/Message_Understanding_Conference General MUC dataset information].
 
  
 
== Related Work ==
 
== Related Work ==
  
 
* [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky
 
* [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky

Revision as of 00:14, 27 September 2011

Team Member(s)

Proposal

Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.

The paper referenced below by Chambers and Jurafsky is what we plan to use as a "jumping-off" point, so to speak.

We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.

Baseline & Dataset

(We're still a little bit unsure about this)

The Chambers and Jurafsky paper uses the MUC 4 data set on terrorism. We could use any of the MUC datasets. General MUC dataset information. Another possibility would be to show the power of extracting templates automatically by expanding it to work on a non-standard IE dataset.

In terms of a baseline, the methodology from the Chambers and Jurafsky is a good start, but it will depend on what dataset we'll choose to use. If we use MUC 4 and decide to improve upon the methodology around that dataset, then the baseline from Chambers and Jurafsky will be sufficient. The other option is to use a different dataset, in which case we'll use some "standard" template-based IE methods (admittedly, we haven't yet narrowed down what those methods will be)

Related Work