Unsupervised Modeling of Dialog Acts in Asynchronous Conversation
Shafiq Joty, Giuseppe Carenini, Chin-Yew Lin. Unsupervised Modeling of Dialog Acts in Asynchronous Conversations. In Proceedings of the twenty second International Joint Conference on Artificial Intelligence (IJCAI) 2011. Barcelona, Spain.
This paper aims at Modeling of Dialog Acts in asynchronous conversations in an unsupervised setting. There were 12 different dialog acts targeted, viz. Statement, Polite Mechanism, Yes-no question, Action motivator, Wh-question, Accept response, Open-ended question, Acknowledge and appreciate, Or-clause question, Reject response, Uncertain response, and Rhetorical Question. The experiments were done on conversations from two domains: emails, and discussion fora. The authors started with modeling the problem as a clustering problem. They used a graph theoretic framework and represented the conversation as a Fragment Quotation Graph (FQG), in which each email fragment or forum post was represented as a node, and an edge existed between two nodes if one fragment or post was in response to the other. The weights on the edges were decided using a number of features which we shall see later. An N min cut was then used to cluster the graphs. However, this experiment didn't prove to be doing well. The authors took specific measures so as to avoid topic-based clustering, but the model was still confusing dialog-acts with the topics. The authors, thus, resorted to HMM so that they could make use of the sequential structure of the conversations. However, based on the experiments with clustering they were apprehensive if they could separate topic-modeling from dialog-act-modeling even when HMM was used. So they tried a combination of HMM and Multinomial Mixtures to model the dialog acts. The final results beat the baseline by a significant margin.
The Dialog Act tagset was taken from the Meeting Recorder Dialog Act (MRDA) tagset created by Dhillon et al . The training data used was unlabeled, whereas the test data was labeled by 2 human annotators. The training data for emails was a set of 23957 emails from W3C Email Corpus, while that for the discussion fora, was a set of 25,000 forum threads from the discussion fora of travel advising site TravelAdvisor. The test data for the emails was a set of 40 email threads from the BC3 Corpus (Ulrich et. al.), while that for discussion fora was a set of 200 forum threads. The dialog act categories labelled by human annotators had similar break-up in both the email set and the discussion thread set, as shown in the fig. below. The agreements between the two human annotators were 0.79 for email dataset and 0.73 for forum dataset.
Out of the email and forum data, fragment quotation graphs (FQGs) were created, as mentioned above.
The FQG was then transformed into a similarity graph , in which the sentences from the email or forum post would form the set of nodes and the nodes representing sentences in adjacent posts (as inferred from the FQG) would be joined with edges. Each edge would be assigned a weight, which would be some measure of similarity. A clustering of the nodes was then done with an assumption that sentences within the same cluster would represent the same dialog act. The clustering problem was modeled as an N-mincut graph clustering problem with the cut-criterion as below:
where is the total connection from nodes in partition A to nodes in partition B, is the total connection from nodes in A to all other nodes in the graph; is defined similarly. The authors experimented with a number of measures to find similarity between the sentences: A Bag-Of-Words based measure in which the similarity between two sentences will be the cosine similarity between the vector of TF-IDF scores of the words in the sentences; A variant of BOW measure in which nouns are masked so as to prevent clustering based on topic rather than on dialog acts; A Word-Subsequence Kernel based measure which would transform the vector of words (POS tags for the experiments in this paper) to a higher-dimensional space and find the similarity in that space; An Extended WSK in which syntactic/semantic features of the words were used along with the words (POS tags, rather); A dependency-similarity based measure in which the similarity will be scored by finding number of co-occurring Basic Elements (BEs) in the dependency parse trees of the two sentences (A BE is a (head, modifier, relation) triple); A syntactic tree similarity measure using Tree Kernel function (Collins and Duffy) to find the similarity between the sentences; And finally, a linear combination of all these measures. As baseline, all sentences were assumed to represent the dialog act "Statement", as Statement was the most frequently occurring dialog act in the annotated test set. The results of these experiments are present in the below table. For evaluation a 1-to-1 metric was used, in which the clusters in annotated test set were made to overlap with the clusters in the result until the pair-wise overlap between the clusters from the two sets would be maximum. The mean of percentage of this overlap for each cluster would then be reported as the final score. As can be seen none of the methods surpassed the score of the baseline method. Contrary to the expectation the BOW-M measure yielded worse results than BOW measure.
Probabilistic Conversation Models
The authors realized that graph theoretic framework might not be doing good, because it did not model the sequential structure of the conversations, and other important features like the speaker, relative position or length. For this reason, the authors then modeled the dialog acts using HMM with dialog acts being hidden states, emitting observable sentences. This modeling is shown in the figure below. A conversation is a sequence of hidden Dialog Acts ; each produces an observable sentence ; each is represented by its bag-of-words or unigrams (shown in plate), its speaker (), its relative position i.e. position of the sentence in the post normalized by total no. of sentences in the post (), and its length .
A symmetric Dirichlet prior with was placed on each of the six multinomials (i.e. the distributions over initial states, transitions, unigrams, speakers, position and length). The authors then computed the MAP estimate using Baum-Welch (EM) algorithm with forward-backward. Specifically, given n-th sequence , forward-backward computes:
Where the local evidence is given by:
HMM Plus Mixture Model
Based on earlier work by (Ritter et. al.) the authors modeled the emissions of HMM as a mixture of multinomials. This new model is presented in the figure below.
In the final experiments, the no. of mixtures was set to 3, after experimenting with 1 to 5 no. of mixtures.
The results are as in the table below. The 1-to-1 overlap scores are mentioned for Baseline model (all Statements), HMM and HMM+Mix models for email and discussion fora posts. The experiments were done with both temporal sequence of the posts and their sequence in FQG. As we see HMM+Mix model performs the best and beats the baseline with a significant margin.
 R. Dhillon, S. Bhagat, H. Carvey, and E. Shriberg. Meeting Recorder Project: Dialog Act Labeling Guide. Technical report, ICSI Tech. Report, 2004.
 J. Ulrich, G. Murray, and G. Carenini. A publicly available annotated corpus for supervised email summarization. In EMAIL’08 Workshop. AAAI, 2008.
 Michael Collins and Nigel Duffy. Convolution Kernels for Natural Language. In NIPS-2001, pages 625–632, Vancouver, Canada, 2001.
 A. Ritter, C. Cherry, and B. Dolan. Unsupervised modeling of twitter conversations. In HLT: NAACL’10, LA, California, 2010. ACL