Difference between revisions of "Identifying influential bloggers: WSDM 2008"

Revision as of 15:58, 31 March 2011

Citation

Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu, "Identifying the Influential Bloggers in a Community", Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), 2008.

Online version

Available at Citesteer

Summary

This paper aims at identifying most influential bloggers in a blogging community. The paper first proposes some metric for assessing how influential a blog post is. Then the authors perform some experiments on blogs from few blog sites and qualitatively evaluate their results.

What makes a Blog influential

Recognition: An influential blog post is recognized by many, which can be judged by the number of in-links ( $\iota$ ), i.e. the number of other posts referencing the particular post.
Activity Generation: A blog post that generates more activity is supposedly more influential. This is measured by the number of comments made on the blog post ( $\gamma$ ).
Novelty: Novel ideas are supposed to be more influential [1]. A post that references more other posts (or has more out-links) is supposed to have lesser novel ideas. So novelty can be taken as negatively correlated with the number of out-links ( $\theta$ ).
Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post ( $\lambda$ ) as a measure of eloquence.

Measuring Influence

The authors define a concept called InfluenceFlow. They conjecture that blog-influence flow can be thought of as a graph. For a post p having no. if in-links $\iota$ and no. of out-links $\theta$ , the InfluenceFlow is defined as: Failed to parse (syntax error): {\displaystyle InfluenceFlow(p)= w_in\Sigma (m=1 to \iota) I(p_m) – w_out\Sigma (m=1 to \theta) I(p_n)}
Where win and wout are the weights that can be adjusted for incoming and outgoing influences; pm denotes the blog post that links to the post p, and pn denotes the post to which the post p links; I(px) is the influence score of the post px. Note that unfortunately the paper doesn’t mention how I score is computed from the four parameters discussed above. Authors further define the influence I for a post in terms of the InfluenceFlow, which looks weird, since they’ve already used I score in defining InfluenceFlow. I(p) ∝ wcomγp + InfluenceFlow(p) Where γp is the no. of comments made to the post p, and wcom is a regulating coefficient. For the constant of proportionality, authors use a measure of the quality of the blog. However, the measure used by authors is quite naive and is actually a function of the length of the blog post w(λ). So I(p) = w(λ)x (wcomγp + InfluenceFlow(p)) Authors further define iIndex(B) for a blogger B as max(I(pi)) where I(pi) is the influence score of a post made by blogger B. The higher the value of iIndex for any blogger, more influential they are considered.

Results

Two metrics were used: R-Precision and NDCG. NDCG is described in [2].

References

[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.

@@ Line 17: / Line 17: @@
 Eloquence: More eloquent posts are more influential [1]. Authors use the length of the blog post (<math>\lambda</math>) as a measure of eloquence.
-==The Math behind this==
+==Measuring Influence==
-====Reader Measure====
+The authors define a concept called InfluenceFlow. They conjecture that blog-influence flow can be thought of as a graph. For a post p having no. if in-links <math>\iota</math> and no. of out-links <math>\theta</math>, the InfluenceFlow is defined as:
-Given the full set of comments to a blog, the authors construct a directed reader graph <math>G_R :=(V_R, E_R)</math>. Each node <math>r_a \epsilon V_R</math> is a reader, and an edge <math>e_R(r_b, r_a) \epsilon E_R</math> exists if <math>r_b</math> mentions <math>r_a</math> in one of <math>r_b</math>’s comments. The weight on an edge, <math>W_R(r_b, r_a)</math>, is the ratio between the number of times <math>r_b</math> mentions <math>r_a</math> against all times <math>r_b</math> mentions other readers (including <math>r_a</math>). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where <math>|R|</math> denotes the total number of readers of the blog, and d is the damping factor.
+<math>InfluenceFlow(p)= w_in\Sigma (m=1 to \iota) I(p_m) – w_out\Sigma (m=1 to \theta) I(p_n)</math><br>
+Where win and wout are the weights that can be adjusted for incoming and outgoing influences; pm denotes the blog post that links to the post p, and pn denotes the post to which the post p links; I(px) is the influence score of the post px. Note that unfortunately the paper doesn’t mention how I score is computed from the four parameters discussed above.
-<math>
+Authors further define the influence I for a post in terms of the InfluenceFlow, which looks weird, since they’ve already used I score in defining InfluenceFlow.
-A(r_a) = d*1/|R| + (1-d) \Sigma W_R(r_b, r_a) * A(r_b)............(1)</math><br>
+I(p) ∝ wcomγp + InfluenceFlow(p)
-<math>RM(w_k) = \Sigma tf(w_k, c_i) * A(r_a)...............................(2)</math><br>
+Where γp is the no. of comments made to the post p, and wcom is a regulating coefficient.
-The reader measure of a word <math>w_k</math>, denoted by <math>RM(w_k)</math>, is given in Equation 2, where <math>tf(w_k, c_i)</math>  is the term frequency of word <math>w_k</math> in comment <math>c_i</math>.
+For the constant of proportionality, authors use a measure of the quality of the blog. However, the measure used by authors is quite naive and is actually a function of the length of the blog post w(λ). So
+I(p) = w(λ)x (wcomγp + InfluenceFlow(p))
-====Quotation Measure====
+Authors further define iIndex(B) for a blogger B as max(I(pi)) where I(pi) is the influence score of a post made by blogger B. The higher the value of iIndex for any blogger, more influential they are considered.
-For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph <math>G_Q := (V_Q,E_Q)</math>. Each node <math>c_i \epsilon V_Q</math> is a comment, and an edge <math>(c_j, c_i) \epsilon E_Q</math> indicates <math>c_j</math> quoted sentences from <math>c_i</math>. The weight on an edge, <math>W_Q(c_j, c_i)</math>, is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree <math>D(c_i)</math> of a comment <math>c_i</math> using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of <math>1/|C|</math> where <math>|C|</math> is the number of comments associated with the given post.
-<math>D(c_i) = 1/|C| + \Sigma W_Q(c_j, c_i) * D(c_j)...........(3)</math><br>
-<math>Q_M(w_k) = \Sigma tf(w_k, c_i) * D(c_i)..................(4)</math><br>
-The quotation measure of a word <math>w_k</math>, denoted by <math>QM(w_k)</math>, is given in Equation 4. Word <math>w_k</math>
-appears in comment <math>c_i</math>.
-====Topic Measure====
-Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where <math>|c_i|</math> is the length of comment <math>c_i</math> in number of words, <math>C</math> is the set of comments, and <math>sim(c_i, t_u)</math> is the cosine similarity between comment <math>c_i</math> and the centroid of topic cluster <math>t_u</math>.
-<math>T(t_u) = 1/ \Sigma |c_j|* \Sigma |c_i|*sim(c_i,t_u)......................(5)</math><br>
-<math>TM(w_k) = \Sigma tf(w_k, c_i)*T(t_u)......................................(6)</math><br>
-Equation 6 defines the topic measure of a word <math>w_k</math>, denoted by <math>TM(w_k)</math>. Comment <math>c_i</math> is clustered into topic cluster <math>t_u</math>.
-====Overall Word Representativeness or Importance Score====
-The representativeness score of a word <math>Rep(w_k)</math> is the combination of reader-, quotation- and topic- measures in
-ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive <math>Rep(w_k)</math> using Equation 7. In this equation <math>\alpha</math>, <math>\beta</math> and <math>\gamma</math> are the coefficients (0 ≤ <math>\alpha</math>, <math>\beta</math>, <math>\gamma</math> ≤ 1.0 and <math>\alpha</math> + <math>\beta</math> + <math>\gamma</math> = 1.0).
-<math>Rep(w_k) = \alpha * RM(w_k) + \beta * QM(w_k) + \gamma * TM(w_k).......................(7)</math>
-==Sentence Selection Criteria==
-Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence <math>s_i</math>, <math>Score(w_j)</math> is the representativeness score of keyword <math>w_j</math>, and <math>distance(w_j, w_j+1)</math> is the number of non-keywords (including stopwords) between the two adjacent keywords <math>w_j</math> and <math>w_j+1</math> in <math>s_i</math>.
-<math>Score(s_i) = 1/K * (K + 1) * \Sigma Score(w_j) * Score(w_{j+1})/distance(w_j,w_{j+1})^2............................(8)</math>
-Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, <math>|s_i|</math> is the length of sentence <math>s_i</math> in number of words (including stopwords), and <math>tau</math> (<math>tau</math> > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.
-<math>Rep(s_i) = 1/|s_i| * (\Sigma Rep(w_k)^\tau)^{1/\tau}................................(9)</math>
 ==Results==

Difference between revisions of "Identifying influential bloggers: WSDM 2008"

Revision as of 15:58, 31 March 2011

Contents

Citation

Online version

Summary

What makes a Blog influential

Measuring Influence

Results

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools