Difference between revisions of "Pasca, WWW 2007"

From Cohen Courses
Jump to navigationJump to search
Line 17: Line 17:
 
* Capturing human knowledge : Since people form queries by using their common sense knowledge, queries are a good way of capturing this knowledge.
 
* Capturing human knowledge : Since people form queries by using their common sense knowledge, queries are a good way of capturing this knowledge.
  
 +
The algorithm is given in below figure. It mainly create a pool of candidate phrases by using instances. Then these phrases together with the instances are matched with the queries to find query templates. These query template frequencies are stored in search signature vectors. Vectors of the seed phrases are merged to form reference search signature vector which is used to compare with the search signature vectors to find the reliable candidate phrases. 
  
 
[[File:pseudo-code.png]]
 
[[File:pseudo-code.png]]

Revision as of 14:45, 30 October 2010

Citation

Pasca M. 2007. Organizing and Searching the World Wide Web of Facts Step Two: Harnessing the Wisdom of the Crowds. In Proceedings of the 16th World Wide Web Conference (WWW-07). pages 101-110, Banff, Canada.

Online version

WWW-07

Summary

The first step towards the acquisition of an extensive World Wide Web of facts which can be achieved by mining the from the Web documents. This step has been described in this (Organizing and searching the World Wide Web of facts - step one: the one-million fact extraction challenge) paper. In order to get the most of it from the step 1, the authors suggest to get the types of facts and class attributes of common interest from people in the form of Web search query logs. Therefore the author introduces step 2 which is mining the query logs in order to get more attributes for a target class by using 5 seed attributes or 10 seed instances and without any handcrafted extraction patterns or domain-specific knowledge.

This paper

Mining queries vs documents

  • Amount of text : On the average a query contains only 2 words, on the other hand documents may contain thousands. In theory more data means better results.
  • Ambiguity : While web documents have clear contents, most of the web queries have ambiguity problems due to lack of grammatical structure, typos and misspellings. However, since the most search engines do not provide interactive search session, web users try to give clear and unambiguous queries to get their information fast.
  • Capturing human knowledge : Since people form queries by using their common sense knowledge, queries are a good way of capturing this knowledge.

The algorithm is given in below figure. It mainly create a pool of candidate phrases by using instances. Then these phrases together with the instances are matched with the queries to find query templates. These query template frequencies are stored in search signature vectors. Vectors of the seed phrases are merged to form reference search signature vector which is used to compare with the search signature vectors to find the reliable candidate phrases.

Pseudo-code.png


The author used a random sample of 50 million unique, fully-anonymized queries submitted to Google. A big fraction of these queries are 2-3 words which is makes seeing a class attribute together with class instance less likely.

40 target classes from different domains are used. For each class independently chosen 5 attributes are given. Several similarity functions have been tried. 3 labels are used while assessing the attributes. These are vital (1.0), okay (0.5) or wrong (0.0).

After the evaluations, it has been seen that the quality of attributes varies among classes but average precision over all target classes are high both in absolute value and relative to the attributes that are extracted with handcrafted rules from query logs.