Identifying influential bloggers: WSDM 2008
Contents
Citation
Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu, "Identifying the Influential Bloggers in a Community", Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), 2008.
Online version
Summary
This paper aims at identifying most influential bloggers in a blogging community. The paper first proposes some metric for assessing how influential a blog post is. Then the authors perform some experiments on blogs from few blog sites and qualitatively evaluate their results.
What makes a Blog influential
Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links (), i.e. the number of other posts referencing the particular post.
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post ().
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links ()
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post () as a measure of eloquence.
The Math behind this
Reader Measure
Given the full set of comments to a blog, the authors construct a directed reader graph . Each node is a reader, and an edge exists if mentions in one of ’s comments. The weight on an edge, , is the ratio between the number of times mentions against all times mentions other readers (including ). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where denotes the total number of readers of the blog, and d is the damping factor.
The reader measure of a word , denoted by , is given in Equation 2, where is the term frequency of word in comment .
Quotation Measure
For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph . Each node is a comment, and an edge indicates quoted sentences from . The weight on an edge, , is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree of a comment using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of where is the number of comments associated with the given post.
The quotation measure of a word , denoted by , is given in Equation 4. Word
appears in comment .
Topic Measure
Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where is the length of comment in number of words, is the set of comments, and is the cosine similarity between comment and the centroid of topic cluster .
Equation 6 defines the topic measure of a word , denoted by . Comment is clustered into topic cluster .
Overall Word Representativeness or Importance Score
The representativeness score of a word is the combination of reader-, quotation- and topic- measures in ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive using Equation 7. In this equation , and are the coefficients (0 ≤ , , ≤ 1.0 and + + = 1.0).
Sentence Selection Criteria
Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence , is the representativeness score of keyword , and is the number of non-keywords (including stopwords) between the two adjacent keywords and in .
Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, is the length of sentence in number of words (including stopwords), and ( > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.
Results
Two metrics were used: R-Precision and NDCG. NDCG is described in [2].
References
[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.