Difference between revisions of "Blog summarization: CIKM 2007"

From Cohen Courses
Jump to navigationJump to search
 
(24 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
== Online version ==
 
== Online version ==
  
[http://www.cais.ntu.edu.sg/~meishan/publications/cikm307s-hu.pdf]
+
[http://www.cais.ntu.edu.sg/~meishan/publications/cikm307s-hu.pdf Download from NTU website]
  
 
== Summary ==
 
== Summary ==
This [[Category::paper]] aims at [[AddressesProblem::blog summarization]] by identifying important sentences of a blog by analyzing the comments made to it, and then extracting these sentences to present a summary of the blog. Authors use a Reader-Quotation-Topic model to give representativeness score to different words in user comments. Then, “significant” sentences from the blog are selected based on two methods: Density-based Selection and Summation-based selection. Authors employed humans to create summaries of the blogs to evaluate their method against. Please see the dataset page for information about dataset. Below, is presented a closer look at Reader-Quotation-Topic model and the two sentence-selection methodologies along with the results of the experiments.
+
This [[Category::paper]] aims at [[AddressesProblem::blog summarization]] by identifying important sentences of a blog by analyzing the comments made to it, and then extracting these sentences to present a summary of the blog. Authors use a Reader-Quotation-Topic model to give representativeness score to different words in user comments. Then, “significant” sentences from the blog are selected based on two methods: Density-based Selection and Summation-based selection. Authors employed humans to create summaries of the blogs to evaluate their method against. Please see the [[UsesDataset::Blog Summarization CIKM 2007 dataset]] page for information about dataset. Below, is presented a closer look at Reader-Quotation-Topic model and the two sentence-selection methodologies along with the results of the experiments.
  
 
==Reader-Quotation-Topic (ReQuT) model==
 
==Reader-Quotation-Topic (ReQuT) model==
Line 19: Line 19:
  
 
<math>
 
<math>
A(r_a) = d*1/|R| + (1-d) \Sigma W_R(r_b, r_a) * A(r_b)............(1)
+
A(r_a) = d*1/|R| + (1-d) \Sigma W_R(r_b, r_a) * A(r_b)............(1)</math><br>
</math><br>
+
<math>RM(w_k) = \Sigma tf(w_k, c_i) * A(r_a)...............................(2)</math><br>
<math>RM(w_k) = \Sigma tf(w_k, c_i) * A(r_a)...............................(2)</math>
+
The reader measure of a word <math>w_k</math>, denoted by <math>RM(w_k)</math>, is given in Equation 2, where <math>tf(w_k, c_i)</math> is the term frequency of word <math>w_k</math> in comment <math>c_i</math>.
The reader measure of a word w_k, denoted by RM(w_k), is given in Equation 2, where tf(w_k, c_i)  is the term frequency of word w_k in comment c_i.
+
 
 +
====Quotation Measure====
 +
For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph <math>G_Q := (V_Q,E_Q)</math>. Each node <math>c_i \epsilon V_Q</math> is a comment, and an edge <math>(c_j, c_i) \epsilon E_Q</math> indicates <math>c_j</math> quoted sentences from <math>c_i</math>. The weight on an edge, <math>W_Q(c_j, c_i)</math>, is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree <math>D(c_i)</math> of a comment <math>c_i</math> using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of <math>1/|C|</math> where <math>|C|</math> is the number of comments associated with the given post.
 +
<math>D(c_i) = 1/|C| + \Sigma W_Q(c_j, c_i) * D(c_j)...........(3)</math><br>
 +
<math>Q_M(w_k) = \Sigma tf(w_k, c_i) * D(c_i)..................(4)</math><br>
 +
The quotation measure of a word <math>w_k</math>, denoted by <math>QM(w_k)</math>, is given in Equation 4. Word <math>w_k</math>
 +
appears in comment <math>c_i</math>.
 +
 
 +
====Topic Measure====
 +
Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where <math>|c_i|</math> is the length of comment <math>c_i</math> in number of words, <math>C</math> is the set of comments, and <math>sim(c_i, t_u)</math> is the cosine similarity between comment <math>c_i</math> and the centroid of topic cluster <math>t_u</math>.
 +
 
 +
<math>T(t_u) = 1/ \Sigma |c_j|* \Sigma |c_i|*sim(c_i,t_u)......................(5)</math><br>
 +
<math>TM(w_k) = \Sigma tf(w_k, c_i)*T(t_u)......................................(6)</math><br>
 +
 
 +
Equation 6 defines the topic measure of a word <math>w_k</math>, denoted by <math>TM(w_k)</math>. Comment <math>c_i</math> is clustered into topic cluster <math>t_u</math>.
 +
 
 +
====Overall Word Representativeness or Importance Score====
 +
The representativeness score of a word <math>Rep(w_k)</math> is the combination of reader-, quotation- and topic- measures in
 +
ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive <math>Rep(w_k)</math> using Equation 7. In this equation <math>\alpha</math>, <math>\beta</math> and <math>\gamma</math> are the coefficients (0 ≤ <math>\alpha</math>, <math>\beta</math>, <math>\gamma</math> ≤ 1.0 and <math>\alpha</math> + <math>\beta</math> + <math>\gamma</math> = 1.0).
 +
 
 +
<math>Rep(w_k) = \alpha * RM(w_k) + \beta * QM(w_k) + \gamma * TM(w_k).......................(7)</math>
 +
 
 +
==Sentence Selection Criteria==
 +
Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence <math>s_i</math>, <math>Score(w_j)</math> is the representativeness score of keyword <math>w_j</math>, and <math>distance(w_j, w_j+1)</math> is the number of non-keywords (including stopwords) between the two adjacent keywords <math>w_j</math> and <math>w_j+1</math> in <math>s_i</math>.
 +
 
 +
<math>Score(s_i) = 1/K * (K + 1) * \Sigma Score(w_j) * Score(w_{j+1})/distance(w_j,w_{j+1})^2............................(8)</math>
 +
 
 +
Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, <math>|s_i|</math> is the length of sentence <math>s_i</math> in number of words (including stopwords), and <math>tau</math> (<math>tau</math> > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.
 +
 
 +
<math>Rep(s_i) = 1/|s_i| * (\Sigma Rep(w_k)^\tau)^{1/\tau}................................(9)</math>
 +
 
 +
==Results==
 +
Two metrics were used: R-Precision and NDCG. NDCG is described in [2].<br>
 +
[[File:Results.jpg]]
 +
 
 +
==References==
 +
[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.<br>
 +
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.

Latest revision as of 14:27, 31 March 2011

Citation

Meishan Hu, Aixin Sun and Ee-Peng Lim, "Comments-oriented blog summarization by sentence extraction ", Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007

Online version

Download from NTU website

Summary

This paper aims at blog summarization by identifying important sentences of a blog by analyzing the comments made to it, and then extracting these sentences to present a summary of the blog. Authors use a Reader-Quotation-Topic model to give representativeness score to different words in user comments. Then, “significant” sentences from the blog are selected based on two methods: Density-based Selection and Summation-based selection. Authors employed humans to create summaries of the blogs to evaluate their method against. Please see the Blog Summarization CIKM 2007 dataset page for information about dataset. Below, is presented a closer look at Reader-Quotation-Topic model and the two sentence-selection methodologies along with the results of the experiments.

Reader-Quotation-Topic (ReQuT) model

Each word is given a Reader, a Quotation and a Topic measure. The motivation is that words written by “authoritative” readers, or the ones found in comments which are quoted in other comments, or those that relate to mostly discussed topics, are important than others. So ReQuT scores are given to each word, and the overall importance of that word is judged by a weighted sum of the ReQuT scores.

The Math behind this

Reader Measure

Given the full set of comments to a blog, the authors construct a directed reader graph . Each node is a reader, and an edge exists if mentions in one of ’s comments. The weight on an edge, , is the ratio between the number of times mentions against all times mentions other readers (including ). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where denotes the total number of readers of the blog, and d is the damping factor.



The reader measure of a word , denoted by , is given in Equation 2, where is the term frequency of word in comment .

Quotation Measure

For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph . Each node is a comment, and an edge indicates quoted sentences from . The weight on an edge, , is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree of a comment using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of where is the number of comments associated with the given post.

The quotation measure of a word , denoted by , is given in Equation 4. Word appears in comment .

Topic Measure

Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where is the length of comment in number of words, is the set of comments, and is the cosine similarity between comment and the centroid of topic cluster .



Equation 6 defines the topic measure of a word , denoted by . Comment is clustered into topic cluster .

Overall Word Representativeness or Importance Score

The representativeness score of a word is the combination of reader-, quotation- and topic- measures in ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive using Equation 7. In this equation , and are the coefficients (0 ≤ , , ≤ 1.0 and + + = 1.0).

Sentence Selection Criteria

Density Based Selection: Based on representativeness score of keywords and the distance between two keywords in a sentence. In equation 8, K is the total number of keywords contained in i^th sentence , is the representativeness score of keyword , and is the number of non-keywords (including stopwords) between the two adjacent keywords and in .

Summation Based Selection: Based on the number of keywords contained in a sentence. In equation 9, is the length of sentence in number of words (including stopwords), and ( > 0) is a parameter to flexibly control the contribution of a word’s representativeness score.

Results

Two metrics were used: R-Precision and NDCG. NDCG is described in [2].
Results.jpg

References

[1] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic text message streams. In Proc. of SIGIR ’06, pages 35–42, Seattle, Washington, 2006.
[2] K. Jrvelin and J. Keklinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR ’00, pages 41–48, Athens, Greece, 2000.