Difference between revisions of "Identifying influential bloggers: WSDM 2008"

From Cohen Courses
Jump to navigationJump to search
Line 17: Line 17:
 
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post (<math>\lambda</math>) as a measure of eloquence.
 
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post (<math>\lambda</math>) as a measure of eloquence.
  
==The Math behind this==
+
==Measuring Influence==
====Reader Measure====
+
The authors define a concept called InfluenceFlow. They conjecture that blog-influence flow can be thought of as a graph. For a post p having no. if in-links <math>\iota</math> and no. of out-links <math>\theta</math>, the InfluenceFlow is defined as:
Given the full set of comments to a blog, the authors construct a directed reader graph <math>G_R :=(V_R, E_R)</math>. Each node <math>r_a \epsilon V_R</math> is a reader, and an edge <math>e_R(r_b, r_a) \epsilon E_R</math> exists if <math>r_b</math> mentions <math>r_a</math> in one of <math>r_b</math>’s comments. The weight on an edge, <math>W_R(r_b, r_a)</math>, is the ratio between the number of times <math>r_b</math> mentions <math>r_a</math> against all times <math>r_b</math> mentions other readers (including <math>r_a</math>). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where <math>|R|</math> denotes the total number of readers of the blog, and d is the damping factor.
+
<math>InfluenceFlow(p)= w_in\Sigma (m=1 to \iota) I(p_m) – w_out\Sigma (m=1 to \theta) I(p_n)</math><br>
 
+
Where win and wout are the weights that can be adjusted for incoming and outgoing influences; pm denotes the blog post that links to the post p, and pn denotes the post to which the post p links; I(px) is the influence score of the post px. Note that unfortunately the paper doesn’t mention how I score is computed from the four parameters discussed above.
<math>
+
Authors further define the influence I for a post in terms of the InfluenceFlow, which looks weird, since they’ve already used I score in defining InfluenceFlow.  
A(r_a) = d*1/|R| + (1-d) \Sigma W_R(r_b, r_a) * A(r_b)............(1)</math><br>
+
I(p) ∝ wcomγp + InfluenceFlow(p)
<math>RM(w_k) = \Sigma tf(w_k, c_i) * A(r_a)...............................(2)</math><br>
+
Where γp is the no. of comments made to the post p, and wcom is a regulating coefficient.  
The reader measure of a word <math>w_k</math>, denoted by <math>RM(w_k)</math>, is given in Equation 2, where <math>tf(w_k, c_i)</math>  is the term frequency of word <math>w_k</math> in comment <math>c_i</math>.
+
For the constant of proportionality, authors use a measure of the quality of the blog. However, the measure used by authors is quite naive and is actually a function of the length of the blog post w(λ). So
 
+
I(p) = w(λ)x (wcomγp + InfluenceFlow(p))
====Quotation Measure====
+
Authors further define iIndex(B) for a blogger B as max(I(pi)) where I(pi) is the influence score of a post made by blogger B. The higher the value of iIndex for any blogger, more influential they are considered.
For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph <math>G_Q := (V_Q,E_Q)</math>. Each node <math>c_i \epsilon V_Q</math> is a comment, and an edge <math>(c_j, c_i) \epsilon E_Q</math> indicates <math>c_j</math> quoted sentences from <math>c_i</math>. The weight on an edge, <math>W_Q(c_j, c_i)</math>, is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree <math>D(c_i)</math> of a comment <math>c_i</math> using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of <math>1/|C|</math> where <math>|C|</math> is the number of comments associated with the given post.
 
<math>D(c_i) = 1/|C| + \Sigma W_Q(c_j, c_i) * D(c_j)...........(3)</math><br>
 
<math>Q_M(w_k) = \Sigma tf(w_k, c_i) * D(c_i)..................(4)</math><br>
 
The quotation measure of a word <math>w_k</math>, denoted by <math>QM(w_k)</math>, is given in Equation 4. Word <math>w_k</math>
 
appears in comment <math>c_i</math>.
 
 
 
====Topic Measure====
 
Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where <math>|c_i|</math> is the length of comment <math>c_i</math> in number of words, <math>C</math> is the set of comments, and <math>sim(c_i, t_u)</math> is the cosine similarity between comment <math>c_i</math> and the centroid of topic cluster <math>t_u</math>.
 
 
 
<math>T(t_u) = 1/ \Sigma |c_j|* \Sigma |c_i|*sim(c_i,t_u)......................(5)</math><br>
 
<math>TM(w_k) = \Sigma tf(w_k, c_i)*T(t_u)......................................(6)</math><br>
 
 
 
Equation 6 defines the topic measure of a word <math>w_k</math>, denoted by <math>TM(w_k)</math>. Comment <math>c_i</math> is clustered into topic cluster <math>t_u</math>.
 
 
 
====Overall Word Representativeness or Importance Score====
 
The representativeness score of a word <math>Rep(w_k)</math> is the combination of reader-, quotation- and topic- measures in
 
ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive <math>Rep(w_k)</math> using Equation 7. In this equation <math>\alpha</math>, <math>\beta</math> and <math>\gamma</math> are the coefficients (0 ≤ <math>\alpha</math>, <math>\beta</math>, <math>\gamma</math> ≤ 1.0 and <math>\alpha</math> + <math>\beta</math> + <math>\gamma</math> = 1.0).  
 
 
 
<math>Rep(w_k) = \alpha * RM(w_k) + \beta * QM(w_k) + \gamma * TM(w_k).......................(7)</math>
 
 
 
==Sentence Selection Criteria==
 
Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence <math>s_i</math>, <math>Score(w_j)</math> is the representativeness score of keyword <math>w_j</math>, and <math>distance(w_j, w_j+1)</math> is the number of non-keywords (including stopwords) between the two adjacent keywords <math>w_j</math> and <math>w_j+1</math> in <math>s_i</math>.
 
 
 
<math>Score(s_i) = 1/K * (K + 1) * \Sigma Score(w_j) * Score(w_{j+1})/distance(w_j,w_{j+1})^2............................(8)</math>
 
 
 
Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, <math>|s_i|</math> is the length of sentence <math>s_i</math> in number of words (including stopwords), and <math>tau</math> (<math>tau</math> > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.
 
 
 
<math>Rep(s_i) = 1/|s_i| * (\Sigma Rep(w_k)^\tau)^{1/\tau}................................(9)</math>
 
  
 
==Results==
 
==Results==

Revision as of 14:58, 31 March 2011

Citation

Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu, "Identifying the Influential Bloggers in a Community", Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), 2008.

Online version

Available at Citesteer

Summary

This paper aims at identifying most influential bloggers in a blogging community. The paper first proposes some metric for assessing how influential a blog post is. Then the authors perform some experiments on blogs from few blog sites and qualitatively evaluate their results.


What makes a Blog influential

Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links (), i.e. the number of other posts referencing the particular post.
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post ().
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links ().
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post () as a measure of eloquence.

Measuring Influence

The authors define a concept called InfluenceFlow. They conjecture that blog-influence flow can be thought of as a graph. For a post p having no. if in-links and no. of out-links , the InfluenceFlow is defined as: Failed to parse (syntax error): {\displaystyle InfluenceFlow(p)= w_in\Sigma (m=1 to \iota) I(p_m) – w_out\Sigma (m=1 to \theta) I(p_n)}
Where win and wout are the weights that can be adjusted for incoming and outgoing influences; pm denotes the blog post that links to the post p, and pn denotes the post to which the post p links; I(px) is the influence score of the post px. Note that unfortunately the paper doesn’t mention how I score is computed from the four parameters discussed above. Authors further define the influence I for a post in terms of the InfluenceFlow, which looks weird, since they’ve already used I score in defining InfluenceFlow. I(p) ∝ wcomγp + InfluenceFlow(p) Where γp is the no. of comments made to the post p, and wcom is a regulating coefficient. For the constant of proportionality, authors use a measure of the quality of the blog. However, the measure used by authors is quite naive and is actually a function of the length of the blog post w(λ). So I(p) = w(λ)x (wcomγp + InfluenceFlow(p)) Authors further define iIndex(B) for a blogger B as max(I(pi)) where I(pi) is the influence score of a post made by blogger B. The higher the value of iIndex for any blogger, more influential they are considered.

Results

Two metrics were used: R-Precision and NDCG. NDCG is described in [2].
Results.jpg

References

[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.