Mnduong writeup of Wang & Cohen 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:mnduong.

  • This paper introduces SEAL, a method to expand a small set of seeds into a larger set of instances of the same type, as done by Google Sets. The method is language as well as domain independent. It consists of 3 components: the Fetcher, the Extractor, and the Ranker.
  • The Fetcher uses Google's results from querying using the list of seeds and fetches the top pages.
  • For each page, the Extractor then learns several wrappers using an algorithm that's designed to describe the context in which the seeds appear. It then extracts candidate instances using these wrappers. Wrappers are specific to each page, because similar instances are likely to appear in similar contexts, but different pages can have different styles or markup language. This makes the method work across different document styles. The algorithm also works at the character level, which makes it language independent.
  • Finally, the Ranker ranks the candidate instances using a lazy random walk on the graph which consists of all seeds, documents, wrappers, and candidate instances as nodes, and their corresponding relations as edges. At each step, the walker chooses a relation from a uniform distribution among the possible relations out of that node, then chooses a destination node uniformly from all nodes that can be reached using that relation. The ranking of nodes is based on their scores at the end of the walks. The motivation behind this similarity measure is "the more non-noisy entities extracted by a wrapper, the better quality the wrapper (and vice versa), and the more high-quality wrappers derived from a document, the better quality the document (and vice versa)."
  • In evaluation, the system was shown to outperform Google Sets, achieving twice as high MAP in English. It also performed well in Chinese and Japanese, but there are no other known method for these languages to compare against.

Comments/Questions:

  • The method is quite simple. I like the fact that the Extractor is language and domain independent. The graph similarity way to rank the candidate instances is also well-motivated.
  • How fast did SEAL run? It'll be interesting to have a comparison against KnowItAll's extraction module.