Pasca, WWW 2007

From Cohen Courses
Jump to navigationJump to search

Citation

Pasca M. 2007. Organizing and Searching the World Wide Web of Facts Step Two: Harnessing the Wisdom of the Crowds. In Proceedings of the 16th World Wide Web Conference (WWW-07). pages 101-110, Banff, Canada.

Online version

WWW-07

Summary

The first step towards the acquisition of an extensive World Wide Web of facts can be achieved by mining the Web documents. This step has been described in Pasca et al, AAAI 2006. In order to get the most of it from this step 1, the authors suggest to get the types of facts and class attributes of common interests from people in the form of Web search query logs. Therefore the author introduces step 2 which is mining the query logs in order to get more attributes for a target class by using 5 seed attributes or 10 seed instances and without any handcrafted extraction patterns or domain-specific knowledge.

There are some differences in mining queries vs documents. These are:

  • Amount of text : On the average a query contains only 2 words, on the other hand documents may contain thousands. In theory more data means better results.
  • Ambiguity : While web documents have clear contents, most of the web queries have ambiguity problems due to lack of grammatical structure, typos and misspellings. However, since the most search engines do not provide interactive search session, web users try to give clear and unambiguous queries to get their information fast.
  • Capturing human knowledge : Since people form queries by using their common sense knowledge, queries are a good way of capturing this knowledge.

The algorithm used in the paper is given in the below figure. Basically it creates a pool of candidate phrases by using instances. Then these phrases together with the instances are matched with the queries to find query templates. These query template frequencies are stored in search signature vectors. Vectors of the seed phrases are merged to form reference search signature vector which is used to compare with the search signature vectors to find the reliable candidate phrases.

Pseudo-code.png

The author used Google Web Queries (Pasca) data set for extracting class attributes from query logs. In this data set a big fraction of the queries are 2-3 words which makes seeing a class attribute together with class instance less likely.

40 target classes from different domains are used. For each class, independently chosen 5 attributes are given. Several similarity functions have been tried. 3 labels are used while assessing the attributes. These are vital (1.0), okay (0.5) or wrong (0.0).

After the evaluations, it has been seen that the quality of attributes varies among classes but average precision over all target classes are high both in absolute value and relative to the attributes that are extracted with handcrafted rules from query logs.

This class attribute extraction paper is followed by Pasca, CIKM 2007 in which the author applied name entity finding to the same corpus to extract instances.