Difference between revisions of "Automated Template Extraction"
Line 2: | Line 2: | ||
* [[User:Fkeith|Francis Keith]] | * [[User:Fkeith|Francis Keith]] | ||
* [[User:Amr1|Andrew Rodriguez]] | * [[User:Amr1|Andrew Rodriguez]] | ||
− | |||
== Proposal == | == Proposal == | ||
Line 11: | Line 10: | ||
We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used. | We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used. | ||
+ | |||
+ | == Goal == | ||
+ | |||
+ | The goal we have is twofold: | ||
+ | |||
+ | * Develop an algorithm for automated template extraction, probably either unsupervised, or potentially semi-supervised | ||
+ | ** It will likely be similar to the Chambers and Jurafsky paper, but likely not exactly the same (as we will be combining a lot of out of the box components) | ||
+ | * Compare the results on MUC-4 to the results from Chambers and Jurafsky | ||
+ | * Apply the algorithm to a new dataset | ||
+ | ** This will not have a baseline | ||
+ | |||
+ | == Methodology == | ||
+ | |||
+ | The components we will need: | ||
+ | |||
+ | * Part of Speech Tagging | ||
+ | * Named Entity Recognition | ||
+ | * Semantic Role Labeling | ||
+ | |||
+ | Chambers and Jurafsky also use clustering algorithms for concluding that two templates are the same (i.e. ''detonate'' and ''destroy''). | ||
== Baseline & Dataset == | == Baseline & Dataset == | ||
− | + | The Chambers and Jurafsky paper uses the [[Uses-Dataset::MUC|MUC 4]] data set on terrorism. To give ourselves a good baseline, we will also use that set. | |
− | + | We will compare our results on MUC-4 with the results from the Chambers and Jurafsky paper. | |
− | + | == Second Dataset == | |
+ | |||
+ | One of the strengths of automatically generating templates is that it can be done in an unsupervised manner. In this way, we will show that this can be used to not only be expanded easily to new domains, but also it can be used to get significant information about domains. | ||
== Related Work == | == Related Work == | ||
* [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky | * [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky | ||
+ | |||
+ | == Other Links == | ||
+ | |||
+ | * [http://en.wikipedia.org/wiki/Message_Understanding_Conference General MUC information from Wikipedia] |
Revision as of 22:26, 5 October 2011
Contents
Team Member(s)
Proposal
Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.
The paper referenced below by Chambers and Jurafsky is what we plan to use as a "jumping-off" point, so to speak.
We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.
Goal
The goal we have is twofold:
- Develop an algorithm for automated template extraction, probably either unsupervised, or potentially semi-supervised
- It will likely be similar to the Chambers and Jurafsky paper, but likely not exactly the same (as we will be combining a lot of out of the box components)
- Compare the results on MUC-4 to the results from Chambers and Jurafsky
- Apply the algorithm to a new dataset
- This will not have a baseline
Methodology
The components we will need:
- Part of Speech Tagging
- Named Entity Recognition
- Semantic Role Labeling
Chambers and Jurafsky also use clustering algorithms for concluding that two templates are the same (i.e. detonate and destroy).
Baseline & Dataset
The Chambers and Jurafsky paper uses the MUC 4 data set on terrorism. To give ourselves a good baseline, we will also use that set.
We will compare our results on MUC-4 with the results from the Chambers and Jurafsky paper.
Second Dataset
One of the strengths of automatically generating templates is that it can be done in an unsupervised manner. In this way, we will show that this can be used to not only be expanded easily to new domains, but also it can be used to get significant information about domains.
Related Work
- Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky