Difference between revisions of "Automated Template Extraction"

From Cohen Courses
Jump to navigationJump to search
 
(6 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
* [[User:Fkeith|Francis Keith]]
 
* [[User:Fkeith|Francis Keith]]
 
* [[User:Amr1|Andrew Rodriguez]]
 
* [[User:Amr1|Andrew Rodriguez]]
* Anyone else who may be interested
 
  
 
== Proposal ==
 
== Proposal ==
Line 8: Line 7:
 
Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.
 
Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.
  
The paper referenced below by Chambers and Jurafsky is what I plan to use as a "jumping-off" point, so to speak.  
+
The paper referenced below by Chambers and Jurafsky is what we plan to use as a "jumping-off" point, so to speak.  
  
I'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.
+
We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.
  
== Baseline ==
+
== Goal ==
  
Given that this is a fairly novel approach, I'm not sure how easy it will be to find a baseline. I suppose it will depend on the final project methodology - if the focus is solely on the automated template extraction, it would be reasonable to attempt to compare a standard IE system and "hand-written" or some other "gold standard" templates with the automatically generated templates. It's something that will need to be given some thought.
+
The goal we have is twofold:
  
== Dataset ==
+
* Develop an algorithm for automated template extraction, probably either unsupervised, or potentially semi-supervised
 +
** It will likely be similar to the Chambers and Jurafsky paper, but likely not exactly the same (as we will be combining a lot of out of the box components)
 +
* Compare the results on MUC-4 to the results from Chambers and Jurafsky
 +
* Apply the algorithm to a new dataset
 +
** This will not have a baseline
  
I'm still hunting around for a good dataset to use for this problem.
+
== Intuition ==
 +
 
 +
Templates in an information extraction task generally represent important information to pull from a subset of all the documents. The intuition we're following is that, generally, the information we're seeking is a specific semantic role within a specific action (i.e. who performed action ''X''). By this reasoning, by finding the semantic relations within a given document, we should be able to obtain most of the possible important templates.
 +
 
 +
We will also need to devise a way to filter out bad templates, or templates that are non-indicative of the domain. There are certainly many ways to do this. Something as simple as taking the ''N'' templates that occur most often in the data is one way, while we could also use more complex clustering like Chambers and Jurafsky.
 +
 
 +
One of the nice things about this idea is that it is unsupervised. Assuming we have the tools to do semantic role labeling, clustering, and choosing of the templates, we can apply this technique to any domain. In addition, it need not be related to information extraction - using semantic role labeling in this unsupervised manner could help with document summarization, as well as help in gaining domain knowledge.
 +
 
 +
== Methodology ==
 +
 
 +
The components we will need:
 +
 
 +
* Part of Speech Tagging
 +
* Named Entity Recognition
 +
* Semantic Role Labeling ([http://cogcomp.cs.illinois.edu/page/software_view/12 Illinois Semantic Role Labeller] or [http://www.surdeanu.name/mihai/swirl/ SwiRL])
 +
 
 +
Chambers and Jurafsky also use clustering algorithms for concluding that two templates are the same (i.e. ''detonate'' and ''destroy''). We will begin doing this in a very simple manner (likely using just WordNet to find alternate options, and taking the ''N''-best).
 +
 
 +
From here, we plan to improve various components. This will likely be implementation specific depending on what tools we use. Some out of the box components will be difficult to improve upon, while for others it will be easier to plug in various algorithms. In particular, the clustering should be an improvement point to focus on. However, given that that's less in the scope of the class, finding ways to improve the other components (which are structured prediction components) will be more interesting. Again, these will depend on specific components, and how customizable they are.
 +
 
 +
== Baseline & Dataset ==
 +
 
 +
The Chambers and Jurafsky paper uses the [[UsesDataset::MUC|MUC 4]] data set on terrorism. To give ourselves a good baseline, we will also use that set.
 +
 
 +
We will compare our results on MUC-4 with the results from the Chambers and Jurafsky paper.
 +
 
 +
== Second Dataset ==
 +
 
 +
One of the strengths of automatically generating templates is that it can be done in an unsupervised manner. In this way, we will show that this can be used to not only be expanded easily to new domains, but also it can be used to get significant information about domains.
 +
 
 +
We still need to determine a specific second dataset.
  
 
== Related Work ==
 
== Related Work ==
  
 
* [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky
 
* [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Template-Based Information Extraction without the Templates] by Nathanael Chambers and Dan Jurafsky
 +
 +
== Other Links ==
 +
 +
* [http://en.wikipedia.org/wiki/Message_Understanding_Conference General MUC information from Wikipedia]

Latest revision as of 21:45, 6 October 2011

Team Member(s)

Proposal

Template-based information extraction methods have one glaring weakness: they rely on - you guessed it - templates. These templates are often hand-crafted, and thus either require a significant amount of time and painstaking tuning, or they are prone to errors. Neither of these alternatives is ideal, so it would be beneficial if we could automatically produce these templates from data.

The paper referenced below by Chambers and Jurafsky is what we plan to use as a "jumping-off" point, so to speak.

We'd like to look more into the paper's methodology, apply it to a new domain, and potentially improve upon some methodology that is used.

Goal

The goal we have is twofold:

  • Develop an algorithm for automated template extraction, probably either unsupervised, or potentially semi-supervised
    • It will likely be similar to the Chambers and Jurafsky paper, but likely not exactly the same (as we will be combining a lot of out of the box components)
  • Compare the results on MUC-4 to the results from Chambers and Jurafsky
  • Apply the algorithm to a new dataset
    • This will not have a baseline

Intuition

Templates in an information extraction task generally represent important information to pull from a subset of all the documents. The intuition we're following is that, generally, the information we're seeking is a specific semantic role within a specific action (i.e. who performed action X). By this reasoning, by finding the semantic relations within a given document, we should be able to obtain most of the possible important templates.

We will also need to devise a way to filter out bad templates, or templates that are non-indicative of the domain. There are certainly many ways to do this. Something as simple as taking the N templates that occur most often in the data is one way, while we could also use more complex clustering like Chambers and Jurafsky.

One of the nice things about this idea is that it is unsupervised. Assuming we have the tools to do semantic role labeling, clustering, and choosing of the templates, we can apply this technique to any domain. In addition, it need not be related to information extraction - using semantic role labeling in this unsupervised manner could help with document summarization, as well as help in gaining domain knowledge.

Methodology

The components we will need:

Chambers and Jurafsky also use clustering algorithms for concluding that two templates are the same (i.e. detonate and destroy). We will begin doing this in a very simple manner (likely using just WordNet to find alternate options, and taking the N-best).

From here, we plan to improve various components. This will likely be implementation specific depending on what tools we use. Some out of the box components will be difficult to improve upon, while for others it will be easier to plug in various algorithms. In particular, the clustering should be an improvement point to focus on. However, given that that's less in the scope of the class, finding ways to improve the other components (which are structured prediction components) will be more interesting. Again, these will depend on specific components, and how customizable they are.

Baseline & Dataset

The Chambers and Jurafsky paper uses the MUC 4 data set on terrorism. To give ourselves a good baseline, we will also use that set.

We will compare our results on MUC-4 with the results from the Chambers and Jurafsky paper.

Second Dataset

One of the strengths of automatically generating templates is that it can be done in an unsupervised manner. In this way, we will show that this can be used to not only be expanded easily to new domains, but also it can be used to get significant information about domains.

We still need to determine a specific second dataset.

Related Work

Other Links