Difference between revisions of "Document representation and query expansion models for blog recommendation"

Revision as of 18:45, 5 November 2012

This is a summary of research paper as part of Social Media Analysis 10-802, Fall 2012.

Citation

J. Arguello, J. L. Elsas, J. Callan, and J. G. Carbonell. Document representation and query expansion models for blog recommendation. In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM), 2008.

Online Version

Direct PDF link

Abstract from the paper

We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.

Summary

Overview

This paper explores different document representation models and query expansion techniques for blog recommendation task as compared to traditional ad-hoc information retrieval task. They are mainly attempting to provide a suitable ranked list of blogs which can satisfy the user's information need represented solely through his/her query. Some of the differences between blog retrieval and ad-hoc retrieval as pointed out by the authors are

Relevance of blog to a query depends on all the posts in it rather than a single document as in ad-hoc retrieval
Relevance of just on post may not make the entire blog good for recommendation
A short query may not represent accurately the user's information need and his/her interests on various aspects of discussions usually present on blogs
Blogs contain lot of noise in the form of reader comments, spams unlike traditional documents

In order to address these differences from ad-hoc retrieval, the authors explore two aspects in blog retrieval - Blog(Document) representation and Query expansion.

Blog Representation and Retrieval Models

They propose two different blog representation models - Large document and Small document models.

Large Document Representation Model

In this model, all the posts within a blog are treated as one big document for indexing. This makes it easy to apply traditional information retrieval techniques on them. They use Indri's language modelling for retrieval and Markov random field retrieval model for ranking the blogs. This retrieval model helps to extend the query using ordered and unordered window constraints to obtain final relevance scores. It is likely to have some drawbacks such as - larger posts in a model dominating the language model, document length normalization may need to be more robust to handle skewed document length due to blogs which can contain varied number of posts based on how frequently they are updated.

Small Document Representation Model

In this model, the authors, represent each post in a blog as a retrieval unit and aggregate the post rankings to obtain entire blog rank. Using cues from distributed information retrieval, they rank the blogs higher which are likely to contain more relevant posts. They use the large document query likelihood model to obtain blog relevance scores.

@@ Line 21: / Line 21: @@
 In order to address these differences from ad-hoc retrieval, the authors explore two aspects in blog retrieval - Blog(Document) representation and Query expansion.
-=== Blog Representation ===
+=== Blog Representation and Retrieval Models ===
 They propose two different blog representation models - Large document and Small document models.
 ==== Large Document Representation Model ====
+In this model, all the posts within a blog are treated as one big document for indexing. This makes it easy to apply traditional information retrieval techniques on them. They use Indri's language modelling for retrieval and Markov random field retrieval model for ranking the blogs. This retrieval model helps to extend the query using ordered and unordered window constraints to obtain final relevance scores. It is likely to have some drawbacks such as - larger posts in a model dominating the language model, document length normalization may need to be more robust to handle skewed document length due to blogs which can contain varied number of posts based on how frequently they are updated.
 ==== Small Document Representation Model ====
+In this model, the authors, represent each post in a blog as a retrieval unit and aggregate the post rankings to obtain entire blog rank. Using cues from distributed information retrieval, they rank the blogs higher which are likely to contain more relevant posts. They use the large document query likelihood model to obtain blog relevance scores.
+[[File:small_document_model_1.png]]
 === Query Expansion ===

Difference between revisions of "Document representation and query expansion models for blog recommendation"

Revision as of 18:45, 5 November 2012

Contents

Citation

Online Version

Abstract from the paper

Summary

Overview

Blog Representation and Retrieval Models

Large Document Representation Model

Small Document Representation Model

Query Expansion

Evaluation

Discussion

Related Papers

Study Plan

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools