Dong et al WWW 2010

From Cohen Courses
Jump to navigationJump to search

Citation

Anlei Dong Ruiqiang Zhang Pranam Kolari Jing Bai Fernando Diaz Yi Chang Zhaohui Zheng: Time is of the Essence: Improving Recency Ranking Using Twitter Data, Proceedings of the 19th international conference on World wide web, 2010

author = {Anlei Dong
         Ruiqiang Zhang
         Pranam Kolari
         Jing Bai
         Fernando Diaz
         Yi Chang
         Zhaohui Zheng},	
title  = {Time is of the Essence: Improving Recency Ranking Using Twitter Data},
conference = {international conference on World wide web},
year   = {2010},
keywords = {Twitter, recency ranking, recency modeling},
url    = {http://portal.acm.org/citation.cfm?id=1772725 },

Online version

[1]

Summary

This paper discusses using URLs found in tweets to improve ranking in terms of both relevance and recency. Many queries recieved by a web portal can be classified as as being "recency sensitive." These need not only be "fresh" (meaning newly created and about recent topics) but they must nevertheless be relevant as well. These queries pose a particular problem for search engines because very recent documents may not even be indexed yet, and even if they are indexed, there may be a relatively weak link structure in the current index with which to determine the page's rank. Similarly, ranking by click-tracking may also suffer from lack of data. The authors propose the following 3 points:

  1. Twitter is likely to contain URLs of uncrawled documents likely to be relevant to recency sensitive queries.
  2. The text of Twitter posts can be used to expand document text representations.
  3. The social network of Twitter users can be used to improve ranking.

Using Twitter data, the authors intend to overcome shortcomings of previous strategies for presenting recency sensitive query results. For example, many web portals will show a news vertical using newsfeed (RSS) results related to the query. The authors stipulate that many relevant and recent results might not be news items, and thus not present in the RSS feeds tracked by the portal.


Brief description of the method

Crawling

The authors wish to avoid the overhead of crawling all URLs found in a Twitter stream, and in addition want to remove some types of pages from recency results, such as spam, porn, etc. They use heuristics to filter the URLs from the set of URLs to be crawled. Any URL referred to by the same user more than twice is filtered, as are URLs only referred to by one user ever. The combination of these filters can reduce the number of URLs to be crawled 20-fold.

Ranking

Features

In a typical search setting, results are represented by content features (attributes of the page in isolation) and aggregate features (information external to the page, such as PageRank and click rates). As mentioned in the summary, aggregate features are expected to be less significant for ranking very recent documents. To address this, the authors introduce the following Twitter features:

  • cosine similarity between query and sum of all term occurrences across all tweets containing the URL
  • term overlap between query and sum of all term occurrences across all tweets containing the URL
  • average number of followers for the users who issued the tiny URL
  • average post number for the users who issued the tiny URL
  • average number of users who retweeted the tweets containing the tiny URL
  • average number of users who replied to those users that issued the tiny URL
  • average number of followings for the users who issued the tiny URL
  • average Twitter score of all the users who issued the tiny URL
  • number of followers for the user who first issued the tinyURL number of posts by the user who first issued the tiny URL
  • number of users who retweeted the user who first issued the tiny URL
  • number of users who replied the user who first issued the tiny URL
  • number of followings for the user who first issued the tiny URL
  • Twitter score of the users who first issued the tiny URL
  • number of followers for the user who issued the tiny URL with the highest Twitter score
  • number of posts by the user who issued the tiny URL with the highest Twitter score
  • number of users who retweeted the user who issued the tiny URL and has the highest Twitter score
  • number of users who replied the user who issued the tiny URL and has the highest Twitter score
  • number of followings for the user who has the highest Twitter score among the users that issued the tiny URL
  • Twitter score of the users who issued the tiny URL and who is the highest Twitter score
  • number of different users who sent the tiny URL.

Models

The authors trained three ranking models, a baseline based on only content features, another based on content and aggregate features, and one more based on content and Twitter features. These models are trained using the Gradient Boosted Decision Tree algorithm.

Evaluation

Queries were collected from the Yahoo! portal for one hour each day over several days. Queries automatically classified as being time-sensitive were kept. The authors generated ranked results for each query according to each above model, then evaluated these results using normalized discounted cumulative gain, which conflates recency and relevance. To measure recency in isolation, they introduced their own metric of discounted cumulative freshness based on binary freshness labels and documents' positions in the results list.

This evaluation showed that the model based on content-only features underperformed the baseline of content and aggregate features - even for fresh documents that were impoverished in terms of the aggregate features. The model using Twitter features outperformed both.

Datasets

The authors use the Twitter API in addition to their own (Yahoo's) document indexes and of query logs.