Wu and Weld ACL 2010

From Cohen Courses
Jump to navigationJump to search

Citation

Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association For Computational Linguistics (Uppsala, Sweden, July 11 - 16, 2010). ACL Workshops. Association for Computational Linguistics, Morristown, NJ, 118-127.

Online version

ACM Digital Library

Summary

This is a latest paper that addressed the Open Information Extraction problem. Authors proposed an extraction system, WOE. First training data was extracted from Wikipedia using KYLIN (Wu_and_Weld_CIKM_2007), and then it was processed to train an unlexicalized extractor as TEXTRUNNER Banko_et_al_IJCAI_2007. There are many similarities between WOE and the other two systems.

There are three components in the system:

  1. Processor
    • Wikipedia pages are parsed by OpenNLP tools and Standford parser.
    • Redirection and backward links are used to construct the synonym sets for entities.
  2. Matcher
    • For each attribute-value pairs (relations), matcher heuristically look for a reference sentence in the article for it. DBpedia was used for the clean set of infobox.
  3. Extractor
    • First option was to train a classifier to decide if the shortest dependency path between two NPs is a relation. Second option was to train a CRF as in TEXTRUNNER to tag if the words between two NPs are part of a relation.

Three corpus was used in evaluation: 300 random sentences from Penn Treebank WSJ, Wikipedia, and Web pages.

Related papers

More details of KYLIN can be found in Wu_and_Weld_CIKM_2007 in the task of completing infoboxs in Wikipedia pages. TEXTRUNNER was described in Banko_et_al_IJCAI_2007.