Difference between revisions of "David M. Blei and Pedro J. Moreno, Topic Segmentation with an Aspect Hidden Markov Model, SIGIR 2001"

From Cohen Courses
Jump to navigationJump to search
 
(26 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
== Summary ==
 
== Summary ==
  
This paper addresses the problem of [[AddressesProblem::Topic Segmentation]] on unstructured text. It is an extension to the previous approach of using the hidden Markov model ([[UsesMethod::HMM]]) for modeling sequence of words in the document as being generated by a latent topic variable in a time series. The paper extends this idea by injecting Hofmann's [[UsesMethod::Aspect model]] to HMM to also model topical dependency between words hence a more cohesive segmentation model.  
+
This paper addresses the problem of [[AddressesProblem::Topic Segmentation]] on unstructured text. The paper is an extension to the previous approach that uses hidden Markov model ([[UsesMethod::HMM]]) for modeling sequence of words in a document as being generated by a latent topic variable in a time series. The paper extends this idea by adding Hofmann's [[UsesMethod::Aspect model]] to HMM to also model topical dependency between words and create a more cohesive segmentation model.  
  
The paper applies this [[UsesMethod::aspect HMM]] (AHMM) model to segment unbroken streams of New York Times articles as well as noisy transcripts of radio programs on [[UsesDataset::SpeechBot]]. They show that AHMM outperforms HMM for this task.
+
The paper applies this [[UsesMethod::aspect HMM]] (AHMM) model to segment unbroken streams of New York Times articles as well as noisy transcripts of radio programs on [[UsesDataset::SpeechBot]]. They show that AHMM outperforms HMM for this task.
  
 
== Description of the method ==
 
== Description of the method ==
  
In HMM model to segment a document, the document is treated as a bag of words
+
In HMM model to segment a document, the document is treated as a collection of mutually independent sets of words. Each set is probabilistically generated by a hidden topic variable in a time series. Transition probabilities determine the value of the next hidden topic variable in the series.
  
 +
[[File:HMM.png]]
  
First, surface of earth is divided into grid-like cells, each with ''s'' x ''s'' degrees of latitude and longitude. Two people ''A'' and ''B'' co-occurred in a given cell ''C'', at a temporal range ''t'', if both ''A'' and ''B'' took photos geo-tagged within a location in cell ''C'', within ''t'' days of each other. For each pair of users, the number of distinct cells (''k'') in which they co-occurred at temporal range ''t'' is counted. The probability of friendship between users is computed by first constructing the social network of Flickr using all friendship links up through April 2008 and then identifying spatio-temporal co-occurrences that occurred after April 2008 - hence identifying only friendships existing prior to the accumulation of evidence via co-occurrences. The probability of friendship (fraction of users that are friends) is then computed as a function of ''k'' co-occurrences (indicating amount of evidence for a social tie), cell size ''s'' and temporal time ''t'' (indicating the precision of the evidence).
+
The generative process is as follows: choose a topic ''z'', then generate a set of ''L'' independent words ''w'' from a distribution over words associated with that topic. Then choose another topic from a distribution of allowed transitions between topics. Given an unsegmented document, the most likely sequence of topics that generate the observed ''L''-word sets in the document are computed (using [[UsesMethod::Viterbi]] algorithm). Topic breaks occur at points where the value two consecutive topics are different.  
  
Given the observed distribution in Flickr data of the probability of friendship over number of co-occurrences ''k'', cell size ''s'' and temporal time ''t'', a probabilistic model is proposed to fit the observed distribution. A simple model supposes that the world is divided into ''N'' geographic cells, with ''M'' people (each having one social tie). Each day each pair of friends chooses to visit a place jointly with probability ''&beta;'' and independently with probability <math>1-\beta</math>. The choice of location itself is made randomly. Using [[UsesMethod::Bayes' Law]], the probability of friendship between two people ''F'' given that they visit the same cells on ''k'' consecutive days (''<math>C_k</math>'') is:
+
The drawback of HMM method is the [[UsesMethod::Naive Bayes]] assumption of conditional independence of words within each ''L''-word set given a topic:
  
<math>P(F|C_k)=\frac{P(F)P(C_k|F)}{P(C_k)}</math>
+
<math>P(o_t|z)=\prod_{i=1}^L P(w_i|z)</math>
  
where prior probability of friendship between two people, <math>P(F)</math>:
+
This assumption works well as ''L'' becomes large. However, the larger ''L'' becomes, the less precise (coarser) is the segmentation.
  
<math>P(F)=\frac{1}{M-1}</math>
+
The aspect HMM segmentation model does away with this Naive Bayes assumption of conditional independence of words, by adding a probability distribution (an aspect model) over pairs of discrete random variables: in this case the pair consists of the ''L''-word window of observation and a word. The ''L''-word window of observation is not represented as a set of its words but simply a label which identifies it. It is associated with its corresponding set of words through each window-word pair. With this aspect model, the occurrence of a window of observation ''o'' and a word ''w'' are independent of each other given a hidden topic variable ''z'':
  
and
+
<math>P(o,w,z)=P(o|z)P(w|z)P(z)</math>
  
<math>P(C_k)=P(C_k|F)P(F) + P(C_k|\bar{F})P(\bar{F})=p_1^k . \frac{1}{M-1} + p_2^k . \frac{M-2}{M-1}</math>
+
[[File:aspectHMM.png]]
  
where <math>p_1</math>, probability of two friends being at the same place on 1 given day:
+
The paper uses [[UsesMethod::Expectation Maximization]] and [[UsesMethod::Bayes' Law]] to estimate the parameters <math>P(z)</math>, <math>P(o|z)</math>, and <math>P(w|z)</math>.
  
<math>p_1=\beta+\frac{1-\beta}{N}</math>
+
Given an unsegmented document, aspect HMM divides its words into observation windows of size ''L'' and running the Viterbi algorithm to find the most likely sequence of hidden topics that generate the given document. Segmentation breaks occur when the topic of one window is different from the next window.
  
and <math>p_2</math>, probability of co-occurrence between two non-friends:
+
== Datasets used ==
  
<math>p_2=\frac{1}{N}</math>
+
The aspect HMM segmentation model is applied on two corpora:
 +
* A corpus of [[UsesDataset::SpeechBot]] transcripts from ''All Things Considered'' (ATC), a daily news program on National Public Radio. This dataset consists of 4,917 segments with 35,777 word types and about 4 million word tokens. The word error rate in this corpus is estimated to be in the 20% to 50% range.
 +
* A corpus of 3,830 articles from the New York Times (NYT) consisting of about 4 million word tokens and 70,792 word types.
  
Hence, <math>P(F|C_k)</math>:
+
The aspect HMM is trained with 20 hidden topics in the experiments.
  
<math>P(F|C_k)=\frac{p_1^k}{p_1^k + p_2^k(M-2)}</math>
+
== Experimental Results ==
  
In a more complex model, each pair of friends is randomly chosen a "home" cell drawn from the empirical distribution of Flickr photographs (approximately a power law with exponent 2.45). When they choose a cell to visit on a given day, they sample from a distribution which is not uniform over all cells, but peaked around the home cell and decays with distance according to power law distribution (with exponent &gamma;). Each day, a person independently decides whether to visit a cell with probability ''&alpha;'' or to do nothing. When two friends each choose to visit a cell (an event with probability <math>\alpha_2</math>), with probability &beta; they end up in the same cell and with probability <math>1-\beta</math>, their cell selection is independent.
+
Three variants of the two corpora are used in the experiments:
 +
* a random sequences of segments from the ATC corpus
 +
* a random sequences of segments from the NYT corpus
 +
* actual aired sequences of ATC segments (in this audio transcripts, clear demarcations of segmentation breaks are not explicitly given; this is the primary problem that the paper is trying to tackle)
  
== Datasets used ==
+
The paper uses [[UsesMethod::co-occurrence agreement probability]] (CoAP) to quantitatively evaluate the segments produced by their model. In short the paper uses CoAP to measure how often a segmentation is correct with respect to two words that are ''k'' words apart in the document.
  
Using Flickr's public API interface, a dataset of about 85 million geo-tagged photographs is collected from Flickr. Photos with imprecise geo-tags and/or missing time stamps are removed. About 38 million photos taken by about 490,000 users remained. The social contacts of each of these users are then collected (if they are made public by the user). The dataset contains [[UsesDataset::photos taken by Flickr users as well as their social contacts]].
+
A useful interpretation of the CoAP is through its compliment:
  
== Experimental Results ==
+
<math>P(disagreement) = P(missed)P(seg) + (1 - P(seg))P(false)</math>
  
[[AddressesProblem::Social Network Attribute]]
+
where <math>P(seg)</math> is the a priori probability of a segment, <math>P(missed)</math> is the probability of missing a segment, and <math>P(false)</math> is the probability of hypothesizing a segment where there is no segment.
  
Using the dataset of 38 million geo-tagged photos from Flickr, the paper discovers that the probability of a social tie increases sharply as the number of co-occurrences ''k'' increases and the temporal range ''t'' decreases. Specifically, two randomly chosen Flickr users have 0.0134% chance of being friends, but when they have multiple spatio-temporal co-occurrences, this probability increases significantly: for example, two people have a 60% chance of having a social tie when they have ''k'' = 5 co-occurrences at ''t'' = 1 and ''s'' = 1&deg; latitude-longitude. The observed log-scale probability of friendship (y-axis) over number of co-occurrences ''k'' (x-axis) at ''s'' = 1&deg; is shown below:
+
In the random sequences of segments, the model performs better (i.e. produces better segmentation) on NYT randomized segments than on ATC randomized segments; probably since NYT is a cleaner, more error-free corpus than ATC. The result on actual aired ATC sequence seems worse than either of the randomized test set.
  
[[File:EmpiricalObservation.png]]
+
[[File:resultAHMM.png]]
  
In developing the model to qualitatively fit this interesting observed distribution, the complex model (described above) with parameters ''M'' = 7500, ''N'' = 64800, ''&alpha;'' = 0.29, ''&beta;'' = 0.12, ''&gamma;'' = 1.8 is found to match the observed distribution well. The model's log-scale probability of friendship (y-axis) over number of co-occurrences ''k'' (x-axis) at ''s'' = 1&deg; is shown below:
+
When comparing the performance of aspect HMM (AHMM) to HMM model in segmenting NYT corpus, AHMM outperforms HMM segmentation for small window widths. As the window size increases HMM increasingly does well since all words are counted equally in increasingly larger windows, satisfying HMM's Naive Bayes assumption of mutual independence between words. However, as window size increases the precision of the segmenter also decreases due to coarser segmentation.  
  
[[File:ModelDistribution.png]]
+
[[File:resultAHMM-HMM.png]]
  
== Related Papers ==
+
== Discussion ==
  
The main contribution of this paper is to provide an analytical framework that quantify the "power" of spatio-temporal coincidences (no matter how sparse) and its effect in predicting probability of social ties. Other earlier works that attempt to expose such social network structure have been done using less sparse information on:
+
The novelty of the paper lies on its addition of aspect model to HMM model for segmenting documents. It removes HMM naive assumption that words are generated independently given the hidden topic variable. Instead, words from the selected hidden variable are generated via the aspect model rather than independently generated.  
  
- Anonymized versions of the network itself: [[RelatedPaper::Backstrom L, Dwork C, Kleinberg J (2007) Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. Proceedings of the 16th International World Wide Web Conference]][http://www.cs.cornell.edu/~lars/www07-anon.pdf : link to paper] discussed on the slides during [http://malt.ml.cmu.edu/mw/index.php/Class_Meeting_for_10-802_04/07/2011 Class Meeting for 10-802 04/07/2011], [[RelatedPaper::Narayanan A, Shmatikov V (2009) De-anonymizing social networks. Proceedings of the 30th IEEE Symposium on Security and Privacy pp 173–187]][http://www.cs.utexas.edu/~shmat/shmat_oak09.pdf : link to paper]
+
However, one of the possible drawback of the paper is its very coarse approximation to the probability distribution over the observation windows ''o''. The Viterbi algorithm requires the observation probability <math>P(o_t|z)</math> for each time step. While the HMM uses its Naive Bayes assumption to compute this distribution, the AHMM can only compute conditional probabilities about observation windows which it was exposed to in training. In testing, the observation window <math>o_t</math> may not be something the model has seen before. The paper therefore uses an online approximation to EM to find <math>P(o_t|z)</math> that refines its probability approximation recursively as it sees more words in the observation window. Words in the beginning of the window are weighted more heavily than words towards the end of the window. Therefore, as window size increases, more words make less impact on the observation distribution and the segmenter does not perform as well.
  
- Commonalities in online behavior such as co-visitations to web sites: [[RelatedPaper::Provost F, Dalessandro B, Hook R, Zhang X, Murray A (2009) Audience selection for online brand advertising: Privacy-friendly social network targeting. Proceedings of the International Conference on Knowledge Discovery and Data Mining pp 707–716]][http://www.adsafemedia.com/pdf/Audience_Selection.pdf : link to paper] and tagging shared content with similar keywords [[RelatedPaper::Schifanella R, Barrat A, Cattuto C, Markines B, Menczer F (2010) Folks in folksonomies: Social link prediction from shared metadata. Proceedings of the Third ACM International Conference on Web Search and Data Mining pp 271–280]][http://informatics.indiana.edu/fil/Papers/wsdm141-schifanella.pdf : link to paper]
+
Another possible drawback is that AHMM does not model topic breaks explicitly. Topic breaks are implicitly assumed when two adjacent windows have different topic variables. This lack of explicit modeling of topic breaks is possibly what causes the model's tendency to undersegment, as indicated by the high <math>P(missed)</math> values in the experiment. In future, direct modeling of topic breaks may be explored. The idea of an overlapping window may also be good to improve the precision of the segmentation. The idea to automatically assign labels on each segment (i.e. topic labeling) is also an interesting future direction.
  
- Detailed time series of physical co-presence: [[RelatedPaper::Eagle N, Pentland A, Lazer D (2009) Inferring social network structure using mobile phone data. Proc Natl Acad Sci USA, 106 pp:15274–15278]][http://reality.media.mit.edu/pdfs/network_structure_hidden.pdf : link to paper]
+
== Related Papers ==
 
 
== Discussion ==
 
 
 
The novelty of the paper lies on its quantitative treatment of spatio-temporal coincidences between people and how they are related to the likelihood of social ties. The paper does not address the question on whether or not friendship manifests themselves in pattern of repeated spatio-temporal coincidences. Rather, the strength of the paper lies in the opposite implication: that when two people exhibit multiple spatio-temporal coincidences, this is a strong indicator of a social tie, relative to the baseline frequency of such ties.
 
 
 
Unfortunately, although the paper proposes a probabilistic model to fit the distribution of friendship observed in Flickr data, the paper falls short in providing a quantitative evaluation of its proposed model, aside from that the model matches the observed distribution. No testing (prediction using the model) is conducted. An interesting next direction is perhaps to use the model to try and predict social ties among people based on their spatio-temporal coincidences. Another interesting direction is to explore this framework and model on another dataset, to discover whether or not the same model applies to different datasets, one that is not photography-related, for example.
 
  
Another interesting future direction is perhaps to explore whether it is possible to qualify the type of social ties between two persons, from its spatio-temporal coincidences - i.e. whether it is possible to differentiate strong ties from weak ones. For example, in photography, people may take pictures often with their friends (indicating strong ties). However, they may also take pictures often in popular public events or tourist destinations in which they are part of large crowds - hence decreasing the possibility of strong ties among observed co-occurrences. Differentiating such ties will be an interesting further direction.
+
The main contribution of this paper is its addition of aspect model to HMM model for segmenting documents. Other earlier works for topic segmentation include:
 +
* [[RelatedPaper::Doug Beeferman, Adam Berger, and John Lafferty. Statistical models for text segmentation. Machine Learning, 1999]]
 +
* [[RelatedPaper::Marti A. Hearst. Context and structure in automated full-text information access. University of California at Berkeley dissertation. Computer Science Division Technical Report, 1994]]
 +
* [[RelatedPaper::P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, and J. Yamron. Text segmentation and topic tracking on broadcast news via a hidden markov model approach. 1998]]
 +
* [[RelatedPaper::Thomas Hofmann, Probabilistic Latent Semantic Indexing, SIGIR 2009]]

Latest revision as of 00:14, 29 March 2011

Citation

David M. Blei and Pedro J. Moreno, Topic Segmentation with an Aspect Hidden Markov Model, SIGIR 2001

Online version

Link to paper

Summary

This paper addresses the problem of Topic Segmentation on unstructured text. The paper is an extension to the previous approach that uses hidden Markov model (HMM) for modeling sequence of words in a document as being generated by a latent topic variable in a time series. The paper extends this idea by adding Hofmann's Aspect model to HMM to also model topical dependency between words and create a more cohesive segmentation model.

The paper applies this aspect HMM (AHMM) model to segment unbroken streams of New York Times articles as well as noisy transcripts of radio programs on SpeechBot. They show that AHMM outperforms HMM for this task.

Description of the method

In HMM model to segment a document, the document is treated as a collection of mutually independent sets of words. Each set is probabilistically generated by a hidden topic variable in a time series. Transition probabilities determine the value of the next hidden topic variable in the series.

HMM.png

The generative process is as follows: choose a topic z, then generate a set of L independent words w from a distribution over words associated with that topic. Then choose another topic from a distribution of allowed transitions between topics. Given an unsegmented document, the most likely sequence of topics that generate the observed L-word sets in the document are computed (using Viterbi algorithm). Topic breaks occur at points where the value two consecutive topics are different.

The drawback of HMM method is the Naive Bayes assumption of conditional independence of words within each L-word set given a topic:

This assumption works well as L becomes large. However, the larger L becomes, the less precise (coarser) is the segmentation.

The aspect HMM segmentation model does away with this Naive Bayes assumption of conditional independence of words, by adding a probability distribution (an aspect model) over pairs of discrete random variables: in this case the pair consists of the L-word window of observation and a word. The L-word window of observation is not represented as a set of its words but simply a label which identifies it. It is associated with its corresponding set of words through each window-word pair. With this aspect model, the occurrence of a window of observation o and a word w are independent of each other given a hidden topic variable z:

AspectHMM.png

The paper uses Expectation Maximization and Bayes' Law to estimate the parameters , , and .

Given an unsegmented document, aspect HMM divides its words into observation windows of size L and running the Viterbi algorithm to find the most likely sequence of hidden topics that generate the given document. Segmentation breaks occur when the topic of one window is different from the next window.

Datasets used

The aspect HMM segmentation model is applied on two corpora:

  • A corpus of SpeechBot transcripts from All Things Considered (ATC), a daily news program on National Public Radio. This dataset consists of 4,917 segments with 35,777 word types and about 4 million word tokens. The word error rate in this corpus is estimated to be in the 20% to 50% range.
  • A corpus of 3,830 articles from the New York Times (NYT) consisting of about 4 million word tokens and 70,792 word types.

The aspect HMM is trained with 20 hidden topics in the experiments.

Experimental Results

Three variants of the two corpora are used in the experiments:

  • a random sequences of segments from the ATC corpus
  • a random sequences of segments from the NYT corpus
  • actual aired sequences of ATC segments (in this audio transcripts, clear demarcations of segmentation breaks are not explicitly given; this is the primary problem that the paper is trying to tackle)

The paper uses co-occurrence agreement probability (CoAP) to quantitatively evaluate the segments produced by their model. In short the paper uses CoAP to measure how often a segmentation is correct with respect to two words that are k words apart in the document.

A useful interpretation of the CoAP is through its compliment:

where is the a priori probability of a segment, is the probability of missing a segment, and is the probability of hypothesizing a segment where there is no segment.

In the random sequences of segments, the model performs better (i.e. produces better segmentation) on NYT randomized segments than on ATC randomized segments; probably since NYT is a cleaner, more error-free corpus than ATC. The result on actual aired ATC sequence seems worse than either of the randomized test set.

ResultAHMM.png

When comparing the performance of aspect HMM (AHMM) to HMM model in segmenting NYT corpus, AHMM outperforms HMM segmentation for small window widths. As the window size increases HMM increasingly does well since all words are counted equally in increasingly larger windows, satisfying HMM's Naive Bayes assumption of mutual independence between words. However, as window size increases the precision of the segmenter also decreases due to coarser segmentation.

ResultAHMM-HMM.png

Discussion

The novelty of the paper lies on its addition of aspect model to HMM model for segmenting documents. It removes HMM naive assumption that words are generated independently given the hidden topic variable. Instead, words from the selected hidden variable are generated via the aspect model rather than independently generated.

However, one of the possible drawback of the paper is its very coarse approximation to the probability distribution over the observation windows o. The Viterbi algorithm requires the observation probability for each time step. While the HMM uses its Naive Bayes assumption to compute this distribution, the AHMM can only compute conditional probabilities about observation windows which it was exposed to in training. In testing, the observation window may not be something the model has seen before. The paper therefore uses an online approximation to EM to find that refines its probability approximation recursively as it sees more words in the observation window. Words in the beginning of the window are weighted more heavily than words towards the end of the window. Therefore, as window size increases, more words make less impact on the observation distribution and the segmenter does not perform as well.

Another possible drawback is that AHMM does not model topic breaks explicitly. Topic breaks are implicitly assumed when two adjacent windows have different topic variables. This lack of explicit modeling of topic breaks is possibly what causes the model's tendency to undersegment, as indicated by the high values in the experiment. In future, direct modeling of topic breaks may be explored. The idea of an overlapping window may also be good to improve the precision of the segmentation. The idea to automatically assign labels on each segment (i.e. topic labeling) is also an interesting future direction.

Related Papers

The main contribution of this paper is its addition of aspect model to HMM model for segmenting documents. Other earlier works for topic segmentation include: