Difference between revisions of "Wu and Weld ACL 2010"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association For Computational Linguisti…')
 
m
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This is a latest [[Category::paper]] that addressed the [[AddressesProblem::Open Information Extraction]] problem. Authors proposed an extraction system, WOE. First training data was extracted from Wikipedia using KYLIN ([http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007]), and then it was processed to train an unlexicalized extractor as TEXTRUNNER [http://malt.ml.cmu.edu/mw/index.php/Banko_et_al_IJCAI_2007 Banko et al, IJCAI 2007]. There are many similarities between WOE and the other two systems.
+
This is a latest [[Category::paper]] that addressed the [[AddressesProblem::Open Information Extraction]] problem. Authors proposed an extraction system, WOE. First training data was extracted from Wikipedia using KYLIN ([[Wu_and_Weld_CIKM_2007]]), and then it was processed to train an unlexicalized extractor as TEXTRUNNER [[Banko_et_al_IJCAI_2007]]. There are many similarities between WOE and the other two systems.
  
 
There are three components in the system:
 
There are three components in the system:
Line 23: Line 23:
  
 
== Related papers ==
 
== Related papers ==
More details of KYLIN can be found in [http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007] in the task of refining Wikipedia. TEXTRUNNER was described in [http://malt.ml.cmu.edu/mw/index.php/Banko_et_al_IJCAI_2007 Banko et al, IJCAI 2007].
+
More details of KYLIN can be found in [[Wu_and_Weld_CIKM_2007]] in the task of refining Wikipedia. TEXTRUNNER was described in [[Banko_et_al_IJCAI_2007]].

Revision as of 13:14, 30 September 2010

Citation

Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association For Computational Linguistics (Uppsala, Sweden, July 11 - 16, 2010). ACL Workshops. Association for Computational Linguistics, Morristown, NJ, 118-127.

Online version

ACM Digital Library

Summary

This is a latest paper that addressed the Open Information Extraction problem. Authors proposed an extraction system, WOE. First training data was extracted from Wikipedia using KYLIN (Wu_and_Weld_CIKM_2007), and then it was processed to train an unlexicalized extractor as TEXTRUNNER Banko_et_al_IJCAI_2007. There are many similarities between WOE and the other two systems.

There are three components in the system:

  1. Processor
    • Wikipedia pages are parsed by OpenNLP tools and Standford parser.
    • Redirection and backward links are used to construct the synonym sets for entities.
  2. Matcher
    • For each attribute-value pairs (relations), matcher heuristically look for a reference sentence in the article for it. DBpedia was used for the clean set of infobox.
  3. Extractor
    • First option was to train a classifier to decide if the shortest dependency path between two NPs is a relation. Second option was to train a CRF as in TEXTRUNNER to tag if the words between two NPs are part of a relation.

Three corpus was used in evaluation: 300 random sentences from Penn Treebank WSJ, Wikipedia, and Web pages.

Related papers

More details of KYLIN can be found in Wu_and_Weld_CIKM_2007 in the task of refining Wikipedia. TEXTRUNNER was described in Banko_et_al_IJCAI_2007.