Liuliu writeup of Wang 2009

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2009_automatic_set_instance_extraction_using_the_web by user:Liuliu.

This paper is an extension of the two previous SEAL paper. It extracts instances of a semantic class given only the semantic class name.

The biggest difference is that it uses a noisy instance generator to find a set of noise instance given hyponym patterns. This makes the system language dependent.

The other components of ASIE is based on original and noise-resistant SEAL system with minor changes, such as changing page seeding and wrapper seeding strategy to extract web pages and longest common contexts which could contain only relavent seeds.

What I like

  • Simple is good.

The system doesn't utilize lots of deep linguistic features or machine learning theories. The idea is easy to understand and easy to implement, and words well and efficiently. It can be generalized to other languages. So, I am really interested in knowing whether the "back-off method" for hyponym extraction works. If that works, the system is totally language independent.

My questions

  • Although this system doesn't need parser, pos tagger or other linguistic tools, it works on semi-structured data, such as HTML file. I am curious whether it can work on data without any structure, but just text.