Difference between revisions of "Topic Model Approach to Authority Identification"

From Cohen Courses
Jump to navigationJump to search
 
(28 intermediate revisions by the same user not shown)
Line 16: Line 16:
 
== Summary ==
 
== Summary ==
  
In this paper the authors present a preliminary study of basic approaches to the problem of identifying authoritative documents in a given domain using textual content and report their best performing approach using Hierarchical Topic Models [Blei et al, 2004].
+
In this paper the authors present a preliminary study of basic approaches to the problem of [[AddressesProblem::Authority_Identification|identifying authoritative documents]] in a given domain using textual content and report their best performing approach using Hierarchical [[UsesMethod::Topic Model|Topic Models]] [Blei et al, 2004].
 
Authoritative documents are ones which exhibit novel and relevant information relative to a document collection while demonstrating domain knowledge. Authors define authoritativeness identification task as a ranking problem and focus on product (book [http://www.goodreads.com/ GoodReads] and restaurant [http://www.yelp.com/ Yelp]) reviews utilizing user votes as proxy for helpfulness and authoritativeness.
 
Authoritative documents are ones which exhibit novel and relevant information relative to a document collection while demonstrating domain knowledge. Authors define authoritativeness identification task as a ranking problem and focus on product (book [http://www.goodreads.com/ GoodReads] and restaurant [http://www.yelp.com/ Yelp]) reviews utilizing user votes as proxy for helpfulness and authoritativeness.
  
Line 23: Line 23:
 
The authors have reported results on two datasets.
 
The authors have reported results on two datasets.
 
* Book Reviews [[UsesDataset::GoodReads Dataset ]]
 
* Book Reviews [[UsesDataset::GoodReads Dataset ]]
  * First 326 books in the "Best Books Ever" Category
+
** First 326 books in the "Best Books Ever" Category
  * First 60 odd reviews from each book.
+
** First 60 odd reviews from each book.
* Restaurant Reviews [[UsesDataset::Yelp_academic_dataset]]
+
* Restaurant Reviews [[UsesDataset::Yelp_academic_dataset|Yelp Dataset]]
  * 283 Most reviewed restaurants in the Boston/Cambridge area
+
** 283 Most reviewed restaurants in the Boston/Cambridge area
  
 
Number of "helpful" user votes for each review were considered as a proxy for ranking reviews authoritativeness.
 
Number of "helpful" user votes for each review were considered as a proxy for ranking reviews authoritativeness.
Line 32: Line 32:
 
== Task Description and Evaluation ==
 
== Task Description and Evaluation ==
  
Models:
+
=== Models ===
  
* Heuristic Approaches:
+
==== Heuristic Approaches ====
  * random : Sort reviews randomly
 
  * nwords : Sort reviews by number of votes [More votes more authoritative]
 
  * unique : For each word w, let <math>g_{w}</math> be its count across all documents for all products. Let <math>p_{w}</math> be
 
    its count amongst documents of a given product. Rank a review d of this product by the number of words unique amongst
 
    the document collection.
 
    Specifically, the score associated with a document is, <math>\Sigma_{w \in d s.t. p_{w}=1} log(g_{w} + 1)</math>
 
  
* Summarization-Based Approaches:
+
* random : Reviews sorted randomly
  * sumbasic: Rank documents by the sum-basic criterion [Nenkova and Vanderwende, 2005], ordering reviews of the same product  
+
* nwords : Reviews sorted by number of votes [More votes more authoritative]
    by how many high-frequency words they have relative to the product document collection.  
+
* unique : For each word w, let <math>g_{w}</math> be its count across all documents for all products. Let <math>p_{w}</math> be its count among documents of a given product. Review d of this product are then ranked by the number of words unique among the document collection. The score associated with a document is, <math>\Sigma_{w \in d s.t. p_{w}=1} log(g_{w} + 1)</math>
    The score of a document D is <math>\Sigma_{w \in d P(w)}</math>.
 
  * klsum: Rank by the kl-sum criterion [Haghighi and Vanderwende 2009].  
 
    Ranking by the unigram KL divergence <math>KL(P_{p}||P_{r})</math> , where <math>P_{p}</math> is a smoothed distribution for
 
    all reviews of the same product, <math>P_{r}</math> is a smoothed distribution for each review. Both distributions are drawn
 
    from a symmetric Dirichlet with hyper-parameter 0.01.
 
  
* Discriminative Approach:
+
==== Summarization-Based Approaches ====
  * logreg: A regularized [[Logistic regression]] classifier, trained to pick the best review for each single product versus the bottom  
+
* sumbasic: Rank documents by the sum-basic criterion [Nenkova and Vanderwende, 2005], ordering reviews of the same product by how many high-frequency words they have relative to the product document collection. The score of a document D is <math>\Sigma_{w \in d P(w)}</math>.
    30%. We used L2 regularization with <math>\sigma^2 = 5</math> and the [[L-BFGS]] optimizer [Byrd et al., 1995]
+
* klsum: Rank by the kl-sum criterion [Haghighi and Vanderwende 2009]. Ranking by the unigram KL divergence <math>KL(P_{p}||P_{r})</math> , where <math>P_{p}</math> is a smoothed distribution for all reviews of the same product, <math>P_{r}</math> is a smoothed distribution for each review. Both distributions are drawn from a symmetric Dirichlet with hyper-parameter 0.01.
 +
 
 +
==== Discriminative Approach ====
 +
* logreg: A regularized [[UsesMethod::Logistic regression]] classifier, trained to pick the best review for each single product versus the bottom 30%. The authors used L2 regularization with <math>\sigma^2 = 5</math> and the [[L-BFGS]] optimizer [Byrd et al., 1995]
 +
 
 +
==== Topic Model ====
 +
Variation of hierarchical LDA [Blei et al, 2004] with a fixed tree structure. Model assumes each word in each review either comes from a distribution common to all reviews or from a product specific content distribution common to all reviews of the same product. Ranking is then performed on the number of rare content words, <math>\Sigma_{w \in p} \frac{1}{df_{w}}</math>, where <math>df_{w}</math> is number of reviews of the product that used word w.
  
* Topic Model
 
  Variation of hierarchical LDA [Blei et al, 2004] with a fixed tree structure. Model assumes each word in each review either 
 
  comes from a distribution common to all reviews or from a product specific content distribution common to all reviews of the
 
  same product. Ranking based on number of rare content words, <math>\Sigma_{w \in p} \frac{1}{df_{w}}</math>, <math>df_{w}</math> is number of reviews of the product that used word w.
 
 
[[File:Passos et al fig1.jpg]]
 
[[File:Passos et al fig1.jpg]]
  
* Results:
+
=== Results ===
  
 +
Metric Used:
 +
* Precision@K = <math>\frac{k}{c}</math> , c is the number of reviews one has to look at to find the K Best reviews.
 +
* NDCG@K : Normalized discounted cumulative gain of the first K elements of the ranked list.
 +
* Average Rank of the best review
 +
* Normalized Average Rank of the best review
 +
 
[[File:Passos et al tab1.jpg]]
 
[[File:Passos et al tab1.jpg]]
 +
 
[[File:Passos et al tab2.jpg]]
 
[[File:Passos et al tab2.jpg]]
  
 
== Findings ==
 
== Findings ==
  
* Good results with unique and topic models
+
* Though the topic modeling approach performed the best when compared to other baseline methods, the authors report two major issues with the approach. First, often many irrelevant reviews (like spams) were very well ranked owing to the fact that they were introducing new but unusual words to the distribution leading to a higher score. Secondly, there intuition on mixing between a topic model to select words and tf-df heuristic to rank documents is unusual.
* Concerns:
+
* The paper seems to be a preliminary work by the authors in this domain and it seems there was no follow up work indicating that the approach was not continued with
  - A review that has nothing in common with the topic at hand (such as spam) will tend to be very well ranked, as its unusual  
+
* I wont recommend the paper for future reading. It lacks details and fails to achieve a satisfactory conclusion at the end. As also reported by the authors, current model ranks reviews based on the number of rare content words which causes irrelevant or spam reviews to be ranked higher. One can use other sources of information like author identity, star ratings, etc to improve upon these results.
    words will be assigned to the product distribution and will have a low product count. The models assume all documents are
 
    indeed relevant.
 
  - The topic model, while not without motivation, is still relatively heuristic: the mix between a topic model to select words and  
 
    a tf-idf-like heuristic to rank documents is unusual.
 
  
 
== Related papers ==
 
== Related papers ==
  
* Understanding Hierarchical Topic Models
+
* Understanding Hierarchical Topic Models.
[Blei et al. 2004] D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. 2004. Hierarchical topic models
+
D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. 2004. [http://books.nips.cc/papers/files/nips16/NIPS2003_AA03.pdf Hierarchical topic models and the nested Chinese restaurant process.] Advances in neural information processing systems, 16:106.
and the nested Chinese restaurant process. Advances in neural information processing systems, 16:106.
+
 
 +
* Understanding the sum-basic criterion.
 +
A. Nenkova and L. Vanderwende. 2005. [http://www.cs.bgu.ac.il/~elhadad/nlp09/sumbasic.pdf The impact of frequency on summarization]. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.
 +
 
 +
* Understanding the KL-SUM criterion
 +
A. Haghighi and L. Vanderwende. 2009. [http://acl.eldoc.ub.rug.nl/mirror/N/N09/N09-1041.pdf Exploring content models for multi-document summarization]. In Proceedings of HLT-NAACL 2009, pages 362–370. Association for Computational Linguistics.
  
 
== Study plan ==
 
== Study plan ==
  
* Hierarchical Topic Models
+
* [http://books.nips.cc/papers/files/nips16/NIPS2003_AA03.pdf Hierarchical Topic Models]
* KL Divergence
+
** [http://en.wikipedia.org/wiki/Chinese_restaurant_process Chinese Restaurant Process]
* Regularized Logistic Regression
+
** [[Gibbs_sampling|Gibbs Sampling]]
 +
* [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence KL Divergence]
 +
* [[Logistic Regression]]
 +
** [http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 L2 Regularization]

Latest revision as of 07:16, 4 October 2012

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

 author    = {Alexandre Passos and 
              Jacques Wainer and
              Aria Haghighi},
 title     = {What do you know? A topic-model approach to authority identification},
 journal   = {NIPS 2010 Workshop on Computational Social Science and the Wisdom of the Crowds},
 year      = {2010}

Online version

What do you know? A topic-model approach to authority identification

Summary

In this paper the authors present a preliminary study of basic approaches to the problem of identifying authoritative documents in a given domain using textual content and report their best performing approach using Hierarchical Topic Models [Blei et al, 2004]. Authoritative documents are ones which exhibit novel and relevant information relative to a document collection while demonstrating domain knowledge. Authors define authoritativeness identification task as a ranking problem and focus on product (book GoodReads and restaurant Yelp) reviews utilizing user votes as proxy for helpfulness and authoritativeness.

Dataset Description

The authors have reported results on two datasets.

  • Book Reviews GoodReads Dataset
    • First 326 books in the "Best Books Ever" Category
    • First 60 odd reviews from each book.
  • Restaurant Reviews Yelp Dataset
    • 283 Most reviewed restaurants in the Boston/Cambridge area

Number of "helpful" user votes for each review were considered as a proxy for ranking reviews authoritativeness.

Task Description and Evaluation

Models

Heuristic Approaches

  • random : Reviews sorted randomly
  • nwords : Reviews sorted by number of votes [More votes more authoritative]
  • unique : For each word w, let be its count across all documents for all products. Let be its count among documents of a given product. Review d of this product are then ranked by the number of words unique among the document collection. The score associated with a document is,

Summarization-Based Approaches

  • sumbasic: Rank documents by the sum-basic criterion [Nenkova and Vanderwende, 2005], ordering reviews of the same product by how many high-frequency words they have relative to the product document collection. The score of a document D is .
  • klsum: Rank by the kl-sum criterion [Haghighi and Vanderwende 2009]. Ranking by the unigram KL divergence , where is a smoothed distribution for all reviews of the same product, is a smoothed distribution for each review. Both distributions are drawn from a symmetric Dirichlet with hyper-parameter 0.01.

Discriminative Approach

  • logreg: A regularized Logistic regression classifier, trained to pick the best review for each single product versus the bottom 30%. The authors used L2 regularization with and the L-BFGS optimizer [Byrd et al., 1995]

Topic Model

Variation of hierarchical LDA [Blei et al, 2004] with a fixed tree structure. Model assumes each word in each review either comes from a distribution common to all reviews or from a product specific content distribution common to all reviews of the same product. Ranking is then performed on the number of rare content words, , where is number of reviews of the product that used word w.

Passos et al fig1.jpg

Results

Metric Used:

  • Precision@K = , c is the number of reviews one has to look at to find the K Best reviews.
  • NDCG@K : Normalized discounted cumulative gain of the first K elements of the ranked list.
  • Average Rank of the best review
  • Normalized Average Rank of the best review

Passos et al tab1.jpg

Passos et al tab2.jpg

Findings

  • Though the topic modeling approach performed the best when compared to other baseline methods, the authors report two major issues with the approach. First, often many irrelevant reviews (like spams) were very well ranked owing to the fact that they were introducing new but unusual words to the distribution leading to a higher score. Secondly, there intuition on mixing between a topic model to select words and tf-df heuristic to rank documents is unusual.
  • The paper seems to be a preliminary work by the authors in this domain and it seems there was no follow up work indicating that the approach was not continued with
  • I wont recommend the paper for future reading. It lacks details and fails to achieve a satisfactory conclusion at the end. As also reported by the authors, current model ranks reviews based on the number of rare content words which causes irrelevant or spam reviews to be ranked higher. One can use other sources of information like author identity, star ratings, etc to improve upon these results.

Related papers

  • Understanding Hierarchical Topic Models.

D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. 2004. Hierarchical topic models and the nested Chinese restaurant process. Advances in neural information processing systems, 16:106.

  • Understanding the sum-basic criterion.

A. Nenkova and L. Vanderwende. 2005. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.

  • Understanding the KL-SUM criterion

A. Haghighi and L. Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of HLT-NAACL 2009, pages 362–370. Association for Computational Linguistics.

Study plan