Difference between revisions of "Blog summarization: CIKM 2007"

From Cohen Courses
Jump to navigationJump to search
Line 16: Line 16:
 
==The Math behind this==
 
==The Math behind this==
 
====Reader Measure====
 
====Reader Measure====
Given the full set of comments to a blog, the authors construct a directed reader graph <math>G_R :=(V_R, E_R)</math>. Each node <math>r_a V_R</math> is a reader, and an edge <math>e_R(r_b, r_a) E_R</math> exists if <math>r_b</math> mentions <math>r_a</math> in one of <math>r_b<math>’s comments. The weight on an edge, <math>W_R(r_b, r_a)</math>, is the ratio between the number of times <math>r_b</math> mentions <math>r_a</math> against all times <math>r_b</math> mentions other readers (including <math>r_a</math>). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where <math>|R|</math> denotes the total number of readers of the blog, and d is the damping factor.
+
Given the full set of comments to a blog, the authors construct a directed reader graph <math>G_R :=(V_R, E_R)</math>. Each node <math>r_a \epsilon V_R</math> is a reader, and an edge <math>e_R(r_b, r_a) \epsilon E_R</math> exists if <math>r_b</math> mentions <math>r_a</math> in one of <math>r_b<math>’s comments. The weight on an edge, <math>W_R(r_b, r_a)</math>, is the ratio between the number of times <math>r_b</math> mentions <math>r_a</math> against all times <math>r_b</math> mentions other readers (including <math>r_a</math>). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where <math>|R|</math> denotes the total number of readers of the blog, and d is the damping factor.
  
 
<math>A(r_a) = d•1/|R| + (1-d) ∑W_R(r_b, r_a) • A(r_b)............(1)</math>
 
<math>A(r_a) = d•1/|R| + (1-d) ∑W_R(r_b, r_a) • A(r_b)............(1)</math>
 
<math>RM(w_k) = ∑ tf(w_k, c_i) • A(r_a)...............................(2)</math>
 
<math>RM(w_k) = ∑ tf(w_k, c_i) • A(r_a)...............................(2)</math>
 
The reader measure of a word w_k, denoted by RM(w_k), is given in Equation 2, where tf(w_k, c_i)  is the term frequency of word w_k in comment c_i.
 
The reader measure of a word w_k, denoted by RM(w_k), is given in Equation 2, where tf(w_k, c_i)  is the term frequency of word w_k in comment c_i.

Revision as of 19:56, 30 March 2011

Citation

Meishan Hu, Aixin Sun and Ee-Peng Lim, "Comments-oriented blog summarization by sentence extraction ", Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007

Online version

[1]

Summary

This paper aims at blog summarization by identifying important sentences of a blog by analyzing the comments made to it, and then extracting these sentences to present a summary of the blog. Authors use a Reader-Quotation-Topic model to give representativeness score to different words in user comments. Then, “significant” sentences from the blog are selected based on two methods: Density-based Selection and Summation-based selection. Authors employed humans to create summaries of the blogs to evaluate their method against. Please see the dataset page for information about dataset. Below, is presented a closer look at Reader-Quotation-Topic model and the two sentence-selection methodologies along with the results of the experiments.

Reader-Quotation-Topic (ReQuT) model

Each word is given a Reader, a Quotation and a Topic measure. The motivation is that words written by “authoritative” readers, or the ones found in comments which are quoted in other comments, or those that relate to mostly discussed topics, are important than others. So ReQuT scores are given to each word, and the overall importance of that word is judged by a weighted sum of the ReQuT scores.

The Math behind this

Reader Measure

Given the full set of comments to a blog, the authors construct a directed reader graph . Each node is a reader, and an edge exists if mentions in one of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r_b<math>’s comments. The weight on an edge, <math>W_R(r_b, r_a)} , is the ratio between the number of times mentions against all times mentions other readers (including ). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where denotes the total number of readers of the blog, and d is the damping factor.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A(r_a) = d•1/|R| + (1-d) ∑W_R(r_b, r_a) • A(r_b)............(1)} Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle RM(w_k) = ∑ tf(w_k, c_i) • A(r_a)...............................(2)} The reader measure of a word w_k, denoted by RM(w_k), is given in Equation 2, where tf(w_k, c_i) is the term frequency of word w_k in comment c_i.