Difference between revisions of "Pasca, WWW 2007"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 20: | Line 20: | ||
40 target classes from different domains are used. For each class independently chosen 5 attributes are given. Several similarity functions have been tried. 3 labels are used while assessing the attributes. These are vital (1.0), okay (0.5) or wrong (0.0). | 40 target classes from different domains are used. For each class independently chosen 5 attributes are given. Several similarity functions have been tried. 3 labels are used while assessing the attributes. These are vital (1.0), okay (0.5) or wrong (0.0). | ||
+ | |||
+ | [[File::pseudo-code.png]] |
Revision as of 00:46, 27 October 2010
Citation
Pasca M. 2007. Organizing and Searching the World Wide Web of Facts Step Two: Harnessing the Wisdom of the Crowds. In Proceedings of the 16th World Wide Web Conference (WWW-07). pages 101-110, Banff, Canada.
Online version
Summary
The first step towards the acquisition of an extensive World Wide Web of facts which can be achieved by mining the from the Web documents. This step has been described in this paper. In order to get the most of it from the step 1, the authors suggest to get the types of facts and class attributes of common interest from people in the form of Web search query logs. Therefore the author introduces step 2 which is mining the query logs in order to get more attributes for a target class by using 5 seed attributes or 10 seed instances.
This paper
Mining queries vs documents
- Amount of text : On the average a query contains only 2 words, on the other hand documents may contain thousands. In theory more data means better results.
- Ambiguity : While web documents have clear contents, most of the web queries have ambiguity problems due to lack of grammatical structure, typos and misspellings. However, since the most search engines do not provide interactive search session, web users try to give clear and unambiguous queries to get their information fast.
- Capturing human knowledge : Since people form queries by using their common sense knowledge, queries are a good way of capturing this knowledge.
The author used a random sample of 50 million unique, fully-anonymized queries submitted to Google. A big fraction of these queries are 2-3 words which is makes seeing a class attribute together with class instance less likely.
40 target classes from different domains are used. For each class independently chosen 5 attributes are given. Several similarity functions have been tried. 3 labels are used while assessing the attributes. These are vital (1.0), okay (0.5) or wrong (0.0).