Citation
Meishan Hu, Aixin Sun and Ee-Peng Lim, "Comments-oriented blog summarization by sentence extraction
", Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007
Online version
[1]
Summary
This paper aims at blog summarization by identifying important sentences of a blog by analyzing the comments made to it, and then extracting these sentences to present a summary of the blog. Authors use a Reader-Quotation-Topic model to give representativeness score to different words in user comments. Then, “significant” sentences from the blog are selected based on two methods: Density-based Selection and Summation-based selection. Authors employed humans to create summaries of the blogs to evaluate their method against. Please see the dataset page for information about dataset. Below, is presented a closer look at Reader-Quotation-Topic model and the two sentence-selection methodologies along with the results of the experiments.
Reader-Quotation-Topic (ReQuT) model
Each word is given a Reader, a Quotation and a Topic measure. The motivation is that words written by “authoritative” readers, or the ones found in comments which are quoted in other comments, or those that relate to mostly discussed topics, are important than others. So ReQuT scores are given to each word, and the overall importance of that word is judged by a weighted sum of the ReQuT scores.
The Math behind this
Reader Measure
Given the full set of comments to a blog, the authors construct a directed reader graph
. Each node
is a reader, and an edge
exists if
mentions
in one of
’s comments. The weight on an edge,
, is the ratio between the number of times
mentions
against all times
mentions other readers (including
). The authors compute reader authority using a ranking algorithm, shown in Equation 1, where
denotes the total number of readers of the blog, and d is the damping factor.
![{\displaystyle A(r_{a})=d*1/|R|+(1-d)\Sigma W_{R}(r_{b},r_{a})*A(r_{b})............(1)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8d494056090afd64bcfaaf92af9c5aa6bf5959e8)
![{\displaystyle RM(w_{k})=\Sigma tf(w_{k},c_{i})*A(r_{a})...............................(2)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ccf9bb6594d089cf65d43cfd53411fb0f41c02d5)
The reader measure of a word
, denoted by
, is given in Equation 2, where
is the term frequency of word
in comment
.
Quotation Measure
For the set of comments associated with each blog post, the authors construct a directed acyclic quotation graph
. Each node
is a comment, and an edge
indicates
quoted sentences from
. The weight on an edge,
, is 1 over the number of comments that c_j ever quoted. The authors derive the quotation degree
of a comment
using Equation 3. A comment that is not quoted by any other comment receives a quotation degree of
where
is the number of comments associated with the given post.
![{\displaystyle D(c_{i})=1/|C|+\Sigma W_{Q}(c_{j},c_{i})*D(c_{j})...........(3)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a72caaa2053ebbb29adfd72f181e3c7eb79cc464)
![{\displaystyle Q_{M}(w_{k})=\Sigma tf(w_{k},c_{i})*D(c_{i})..................(4)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/191eca0d3762b7c147e48bcbbe9ca593dc53aa81)
The quotation measure of a word
, denoted by
, is given in Equation 4. Word
appears in comment
.
Topic Measure
Given the set of comments associated with each blog post, the authors group these comments into topic clusters using a Single-Pass Incremental Clustering algorithm presented in [1]. The authors conjecture that a hotly discussed topic has a large number of comments all close to the topic cluster centroid. Thus they propose Equation 5 to compute the importance of a topic cluster, where
is the length of comment
in number of words,
is the set of comments, and
is the cosine similarity between comment
and the centroid of topic cluster
.
![{\displaystyle T(t_{u})=1/\Sigma |c_{j}|*\Sigma |c_{i}|*sim(c_{i},t_{u})......................(5)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/24f34890371555f79ef7143ca735940db2908a0f)
![{\displaystyle TM(w_{k})=\Sigma tf(w_{k},c_{i})*T(t_{u})......................................(6)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/df12fa8faa1400c8e3d3094f3a73982e6bdefefd)
Equation 6 defines the topic measure of a word
, denoted by
. Comment
is clustered into topic cluster
.
Overall Word Representativeness or Importance Score
The representativeness score of a word
is the combination of reader-, quotation- and topic- measures in
ReQuT model. The three measures are first normalized independently based on their corresponding maximum values and then combined linearly to derive
using Equation 7. In this equation
,
and
are the coefficients (0 ≤
,
,
≤ 1.0 and
+
+
= 1.0).
,
is the representativeness score of keyword
, and
is the number of non-keywords (including stopwords) between the two adjacent keywords
and
in
.
is the length of sentence
in number of words (including stopwords), and
(
> 0) is a parameter to flexibly control the contribution of a word’s representativeness score.
<math>Rep(s_i) = 1/|s_i| * (\Sigma Rep(w_k)^\tau)^1/\tau<\math>............................................(9)
Results