Document representation and query expansion models for blog recommendation

From Cohen Courses
Jump to navigationJump to search

This is a summary of research paper as part of Social Media Analysis 10-802, Fall 2012.

Citation

J. Arguello, J. L. Elsas, J. Callan, and J. G. Carbonell. Document representation and query expansion models for blog recommendation. In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM), 2008.

Online Version

Direct PDF link

Abstract from the paper

We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.

Summary

Overview

This paper explores different document representation models and query expansion techniques for blog recommendation task as compared to traditional ad-hoc information retrieval task. They are mainly attempting to provide a suitable ranked list of blogs which can satisfy the user's information need represented solely through his/her query. Some of the differences between blog retrieval and ad-hoc retrieval as pointed out by the authors are

  • Relevance of blog to a query depends on all the posts in it rather than a single document as in ad-hoc retrieval
  • Relevance of just on post may not make the entire blog good for recommendation
  • A short query may not represent accurately the user's information need and his/her interests on various aspects of discussions usually present on blogs
  • Blogs contain lot of noise in the form of reader comments, spams unlike traditional documents

In order to address these differences from ad-hoc retrieval, the authors explore two aspects in blog retrieval - Blog(Document) representation and Query expansion.

Blog Representation and Retrieval Models

They propose two different blog representation models - Large document and Small document models.

Large Document Representation Model

In this model, all the posts within a blog are treated as one big document for indexing. This makes it easy to apply traditional information retrieval techniques on them. They use Indri's language modelling for retrieval and Markov random field retrieval model for ranking the blogs. This retrieval model helps to extend the query using ordered and unordered window constraints to obtain final relevance scores. It is likely to have some drawbacks such as - larger posts in a model dominating the language model, document length normalization may need to be more robust to handle skewed document length due to blogs which can contain varied number of posts based on how frequently they are updated.

Small Document Representation Model

In this model, the authors, represent each post in a blog as a retrieval unit and aggregate the post rankings to obtain entire blog rank. Using cues from distributed information retrieval, they rank the blogs higher which are likely to contain more relevant posts. They use the large document query likelihood model to obtain blog relevance scores. Small document model 1.png
Entry normalization helps to handle varying blog lengths. For Query likelihood, they used the blog and global language model apart from posts language model for smoothing as posts in a blog is not entirely independent of each other.

Query Expansion

Authors suggest that blog retrieval may require modified automatic query expansion techniques as compared to ad-hoc retrieval due to presence of spams which specifically target such methods and also to cater to more specific information needs in blog search. They experiment with two methods for query expansion as discussed next.

Target Corpus Pseudo-Relevance Feedback

In this method, they use Indri's in-built pseudo-relevance feedback model to obtain the top retrieved documents for a given query, and then expanding the query with distinguished terms from these top documents and re-running the weighted query after expansion.

Wikipedia-based Query Expansion

Their Wikipedia-based expansion algorithm contains following steps -

  1. Run base query on Wikipedia corpus as dependence query model
  2. Define top ranked R documents as relevant set and top ranked W documents as working set, where
  3. Rank anchor phrases occurring in working set documents and linking to documents in relevant set.
  4. High scoring anchor phrases were used for base query expansion

The size of R and W help to control variance in topical aspects covered by anchor phrases and the search space for anchor phrases respectively. They experiment with various ranges for size of R and W to see the dependence on them.

Evaluation

The experiments were conducted using TREC BLOG06 collection. They used the same 45 queries as used in the Blog Distillation Task at TREC 2007. Small document model eval 1.png
Results show that Large document model does as well as Small document model with interpolated language model and smoothed estimates of probabilities in terms of P@10 scores. Even for MAP scores, Large document model shows slightly better results than the more complex small document model, while the combined large and small document model scored the highest.

Query expansion eval 1.png
The results for query expansion show that Indri's pseudo-relevance feedback model didn't improve the performance possibly due to large noise in corpus. Wikipedia-based query expansion model showed significant improvement in the scores for both MAP and P@10 metrics. This method is even independent of the document representation model, as it uses external corpus for extracting phrases for query expansion.

Discussion

This paper shows that both large and smoothed small document representation models do well for blog recommendation and both are complimentary - providing an improved result on combining both of them. Also the query expansion technique based on anchor text from Wikipedia has also shown significantly improved results, which means, that they are able to capture the topical variety required for information needs of blog search. It would be interesting to see if pseudo-relevance feedback model would show any improvement by using spam detection methods to avoid spam blogs. Also, apart from using anchor text from Wikipedia sources, it might be a good idea to experiment with various other metadata available on Wikipedia pages which could be used for query expansion.

Related Papers

Study Plan

Resources useful for understanding this paper