Yan et al., ACL-IJCNLP 2009
Citation
Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. 2009. Unsupervised relation extraction by mining Wikipedia texts using information from the web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: (ACL-IJCNLP '09), Vol. 2. Association for Computational Linguistics, Morristown, NJ, USA, 1021-1029.
Online version
Summary
This paper presents an unsupervised method (clustering) of Relation Extraction from Wikipedia pages. The authors used the combination of dependency patterns and surface patterns.
Motivation
Existing semi-supervised and unsupervised methods extraction relations using the redundancy information from corpus or the Web. The co-occurrences of word pairs are collected from the corpus and surface patterns are generated. The hypothesis of the paper is that the semantic features such as the dependency of the text can be useful to extract relations.
The main idea of the paper is to combine the linguistic analysis of Wikipedia pages and frequency information from the Web and improve the relation extraction performance. The pairs of the concepts extracted in Wikipedia pages are grouped into several clusters based on the similarity of their context patterns: dependency patterns from the Wikipedia and surface patterns from the Web.
- The pairs of concepts are first extracted from Wikipedia pages. The anchor-text concepts in the article describing a concepts are considered as related concepts to the described concept.
- The concept pairs are used as queries to Google and retrieve snippets from it.
- Relational terms (verbs and nouns) are extracted from sentences mention concept pairs. These terms are ranked using a entropy based method (Chen et al., IJCNLP 2005).
- Surface patterns are the generated from the strings between the pair of concepts in the snippets.
- Dependency patterns are generated using the sub path of the shortest path between a concept pair in dependency tree of the selected sentence in Wikipedia.
- K-means method was used to cluster the concept pairs into groups.
Evaluation
The method was evaluated on two categories in Wikipedia: "American chief executives" and "Companies". The proposed method was able to find more instances relations with much higher precision.
Related papers
The authors of this paper used the sub paths instead of the full path of the shortest dependency paths (Bunescu and Mooney, EMNLP 2005) as the dependency patterns. They were trying to improve the coverage of the patterns in this way.