Difference between revisions of "Identifying influential bloggers: WSDM 2008"

From Cohen Courses
Jump to navigationJump to search
Line 14: Line 14:
 
Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links (<math>\iota</math>), i.e. the number of other posts referencing the particular post.<br>
 
Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links (<math>\iota</math>), i.e. the number of other posts referencing the particular post.<br>
 
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post (<math>\gamma</math>).<br>
 
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post (<math>\gamma</math>).<br>
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links (<math>\theta</math>)
+
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links (<math>\theta</math>).<br>
 
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post (<math>\lambda</math>) as a measure of eloquence.
 
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post (<math>\lambda</math>) as a measure of eloquence.
  

Revision as of 15:55, 31 March 2011

Citation

Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu, "Identifying the Influential Bloggers in a Community", Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), 2008.

Online version

Available at Citesteer

Summary

This paper aims at identifying most influential bloggers in a blogging community. The paper first proposes some metric for assessing how influential a blog post is. Then the authors perform some experiments on blogs from few blog sites and qualitatively evaluate their results.


What makes a Blog influential

Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links (), i.e. the number of other posts referencing the particular post.
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post ().
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links ().
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post () as a measure of eloquence.

The Math behind this

Reader Measure

Given the full set of comments to a blog, the authors construct a directed reader graph . Each node is a reader, and an edge exists if mentions in one of ’s comments. The weight on an edge, , is the ratio between the number of times mentions against all times mentions other readers (including ). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where denotes the total number of readers of the blog, and d is the damping factor.



The reader measure of a word , denoted by , is given in Equation 2, where is the term frequency of word in comment .

Quotation Measure

For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph . Each node is a comment, and an edge indicates quoted sentences from . The weight on an edge, , is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree of a comment using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of where is the number of comments associated with the given post.

The quotation measure of a word , denoted by , is given in Equation 4. Word appears in comment .

Topic Measure

Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where is the length of comment in number of words, is the set of comments, and is the cosine similarity between comment and the centroid of topic cluster .



Equation 6 defines the topic measure of a word , denoted by . Comment is clustered into topic cluster .

Overall Word Representativeness or Importance Score

The representativeness score of a word is the combination of reader-, quotation- and topic- measures in ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive using Equation 7. In this equation , and are the coefficients (0 ≤ , , ≤ 1.0 and + + = 1.0).

Sentence Selection Criteria

Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence , is the representativeness score of keyword , and is the number of non-keywords (including stopwords) between the two adjacent keywords and in .

Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, is the length of sentence in number of words (including stopwords), and ( > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.

Results

Two metrics were used: R-Precision and NDCG. NDCG is described in [2].
Results.jpg

References

[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.