Difference between revisions of "Topic Model Approach to Authority Identification"

From Cohen Courses
Jump to navigationJump to search
Line 25: Line 25:
 
   * First 326 books in the "Best Books Ever" Category
 
   * First 326 books in the "Best Books Ever" Category
 
   * First 60 odd reviews from each book.
 
   * First 60 odd reviews from each book.
* Restaurant Reviews [[UsesDataset::Yelp dataset ]]
+
* Restaurant Reviews [[UsesDataset::Yelp_academic_dataset]]
 
   * 283 Most reviewed restaurants in the Boston/Cambridge area
 
   * 283 Most reviewed restaurants in the Boston/Cambridge area
  

Revision as of 18:39, 3 October 2012

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

 author    = {Alexandre Passos and 
              Jacques Wainer and
              Aria Haghighi},
 title     = {What do you know? A topic-model approach to authority identification},
 journal   = {NIPS 2010 Workshop on Computational Social Science and the Wisdom of the Crowds},
 year      = {2010}

Online version

What do you know? A topic-model approach to authority identification

Summary

In this paper the authors present a preliminary study of basic approaches to the problem of identifying authoritative documents in a given domain using textual content and report their best performing approach using Hierarchical Topic Models [Blei et al, 2004]. Authoritative documents are ones which exhibit novel and relevant information relative to a document collection while demonstrating domain knowledge. Authors define authoritativeness identification task as a ranking problem and focus on product (book GoodReads and restaurant Yelp) reviews utilizing user votes as proxy for helpfulness and authoritativeness.

Dataset Description

The authors have reported results on two datasets.

 * First 326 books in the "Best Books Ever" Category
 * First 60 odd reviews from each book.
 * 283 Most reviewed restaurants in the Boston/Cambridge area

Number of "helpful" user votes for each review were considered as a proxy for ranking reviews authoritativeness.

Task Description and Evaluation

Models:

  • Heuristic Approaches:
 * random : Sort reviews randomly
 * nwords : Sort reviews by number of votes [More votes more authoritative]
 * unique : For each word w, let  be its count across all documents for all products. Let  be 
   its count amongst documents of a given product. Rank a review d of this product by the number of words unique amongst 
   the document collection. 
   Specifically, the score associated with a document is, 
  • Summarization-Based Approaches:
 * sumbasic: Rank documents by the sum-basic criterion [Nenkova and Vanderwende, 2005], ordering reviews of the same product 
   by how many high-frequency words they have relative to the product document collection. 
   The score of a document D is .
 * klsum: Rank by the kl-sum criterion [Haghighi and Vanderwende 2009]. 
   Ranking by the unigram KL divergence  , where  is a smoothed distribution for 
   all reviews of the same product,  is a smoothed distribution for each review. Both distributions are drawn 
   from a symmetric Dirichlet with hyper-parameter 0.01.
  • Discriminative Approach:
 * logreg: A regularized Logistic regression classifier, trained to pick the best review for each single product versus the bottom 
   30%. We used L2 regularization with  and the L-BFGS optimizer [Byrd et al., 1995]
  • Topic Model
 Variation of hierarchical LDA [Blei et al, 2004] with a fixed tree structure. Model assumes each word in each review either  
 comes from a distribution common to all reviews or from a product specific content distribution common to all reviews of the 
 same product. Ranking based on number of rare content words, ,  is number of reviews of the product that used word w. 

Passos et al fig1.jpg

  • Results:

Passos et al tab1.jpg Passos et al tab2.jpg

Findings

  • Good results with unique and topic models
  • Concerns:
 - A review that has nothing in common with the topic at hand (such as spam) will tend to be very well ranked, as its unusual 
   words will be assigned to the product distribution and will have a low product count. The models assume all documents are 
   indeed relevant.
 - The topic model, while not without motivation, is still relatively heuristic: the mix between a topic model to select words and 
   a tf-idf-like heuristic to rank documents is unusual.

Related papers

  • Understanding Hierarchical Topic Models

[Blei et al. 2004] D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. 2004. Hierarchical topic models and the nested Chinese restaurant process. Advances in neural information processing systems, 16:106.

Study plan

  • Hierarchical Topic Models
  • KL Divergence
  • Regularized Logistic Regression