Wu and Weld ACL 2010

From Cohen Courses
Revision as of 13:16, 26 September 2010 by PastStudents (talk | contribs) (Created page with '== Citation == Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association For Computational Linguisti…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association For Computational Linguistics (Uppsala, Sweden, July 11 - 16, 2010). ACL Workshops. Association for Computational Linguistics, Morristown, NJ, 118-127.

Online version

ACM Digital Library

Summary

This is a latest paper that addressed the Open Information Extraction problem. Authors proposed an extraction system, WOE. First training data was extracted from Wikipedia using KYLIN (Wu and Weld, CIKM 2007), and then it was processed to train an unlexicalized extractor as TEXTRUNNER Banko et al, IJCAI 2007. There are many similarities between WOE and the other two systems.

There are three components in the system:

  1. Processor
    • Wikipedia pages are parsed by OpenNLP tools and Standford parser.
    • Redirection and backward links are used to construct the synonym sets for entities.
  2. Matcher
    • For each attribute-value pairs (relations), matcher heuristically look for a reference sentence in the article for it. DBpedia was used for the clean set of infobox.
  3. Extractor
    • First option was to train a classifier to decide if the shortest dependency path between two NPs is a relation. Second option was to train a CRF as in TEXTRUNNER to tag if the words between two NPs are part of a relation.

Three corpus was used in evaluation: 300 random sentences from Penn Treebank WSJ, Wikipedia, and Web pages.

Related papers

More details of KYLIN can be found in Wu and Weld, CIKM 2007 in the task of refining Wikipedia. TEXTRUNNER was described in Banko et al, IJCAI 2007.