Difference between revisions of "Modeling of Stylistic Variation in Social Media with Stretchy Patterns"

From Cohen Courses
Jump to navigationJump to search
Line 24: Line 24:
 
“from (our) own general election...”</i></p><br>
 
“from (our) own general election...”</i></p><br>
  
 +
'''Extracting the Patterns'''
  
As
+
Patterns are extracted from the training set, using a sliding window over the token stream to generate all allowable combinations of category-gap sequences within the window. As this generates an exponential number of patterns, the authors first filtered this huge list of patterns by their accuracy and coverage. In the experiments, these thresholds were set to a minimum of 60% per-feature precision, and at least 15 document-level hits.
a convention, the label of a singleton category is
 
the name of the type contained in the category
 
(thus "writes" would be the label of a category
 
containing only surface form "writes" and "VBZ"
 
would be the label of the a category containing
 
only the POS tag "VBZ"). The overall number of
 
Gap and non-Gap category instances comprising a
 
pattern is restricted  -  following  Tsur  (2010),  we
 
allow no more than six tokens of  either category.
 
In the case of Gap instances, this restriction is
 
placed on the number of underlying tokens, and
 
not the collapsed GAP+ form.
 
A sequence of tokens in a document matches a
 
pattern if there is some expansion where each
 
token corresponds in  order  to the  pattern’s
 
categories. A given instance of GAP+ will match
 
between zero and six tokens, provided the total
 
number of Gap instances in the pattern do not
 
exceed six
 
2
 
.
 
By way of example, two patterns follow, with
 
two strings that match each. Tokens that match as
 
Gaps are shown in parenthesis.
 
[cc] (GAP+) [adj] [adj]
 
“and (some clients were) kinda popular...”
 
“from (our) own general election...”
 
for (GAP+) [third-pron] (GAP+) [end] [first-pron]
 
“ready for () them (to end) . I am...”
 
“for (murdering) his (prose) . i want…”
 
Although the matched sequences vary in
 
length and content, the stretchy patterns preserve
 
information about the proximity and ordering of
 
particular words and categories. They focus on the
 
relationship between key (non-Gap) words, and
 
allow a wide array of sequences to be matched  by
 
                                                           
 
1 This is actually an extractor parameter, but we
 
collapse all adjacent gaps for all our experiments.
 
2 The restrictions on gaps are extractor parameters,
 
but we picked zero to six gaps for our experiments.
 
a single pattern in a way that traditional word-class
 
n-grams would not.
 
Our “stretchy  pattern”  formalism  strictly
 
subsumes Tsur’s approach in terms of  
 
representational power.  In particular, we could
 
generate the same patterns described in Tsur
 
(2010) by creating a singleton surface form
 
category for each word in Tsur’s HFW and then
 
creating a category called [CW] that contains all of
 
the words in the Tsur CW set, in addition to the  
 
domain-specific  product/manufacturer categories
 
Tsur employed.
 
Label Category Members
 
adj JJ, JJR, JJS
 
cc CC, IN
 
md MD
 
end <period>, <comma>, <question>, <exclamation>
 
first-pron I, me, my, mine, im, I’m
 
second-pron you, your, youre, you’re, yours, y’all
 
third-pron he, him
 
emotional feel, hurt, lonely, love
 
time hour, hours, late, min, minute, minutes, months,
 
schedule, seconds, time, years,
 
male_curse fucking, fuck, jesus, cunt, fucker
 
female_curse god, bloody, pig, hell, bitch, pissed, assed, shit
 
Table 1. Word Categories
 
3.3  Word Categories
 
With the aim of capturing general usage patterns,
 
and motivated by the results of corpus linguists and
 
discourse analysts, a handful token categories were  
 
defined, after the fashion of the LIWC categories
 
as discussed  in Gill (2009). Tokens belonging to
 
categories may be replaced with their category
 
label as patterns are extracted from each document.
 
As a token might belong to multiple categories, the
 
same token sequence may generate, and therefore
 
match multiple patterns.
 
Words from a list of 800 common
 
prepositions, conjunctions, adjectives, and adverbs
 
were included as singleton surface-form categories.
 
Determiners in particular are absent from this list
 
(and from the POS categories that follow), as their
 
absence or  presence  in  a  noun  phrase  is  one  of  the
 
primary variations the stretchy gaps of our patterns
 
were intended to smooth over.
 
A handful of POS categories were selected,
 
r
 
  
==Data-set==
+
==Experiments==
'''Data'''<br>
+
The authors performed experiments on the dataset (as mentioned above) for two experiments: doing gender-based linguistic style variation classification, and showing domain-generalization of stretchy patterns as features. For both the experiments Unigram, Unigram+Bigram, and POS tags were chosen as three features to compare stretchy patterns with. As mentioned above, for the second experiment, a LOO CV was done, training on data from 9 categories and testing on the left out one, in each iteration. The results are as show in the tables below (first for the classification experiment, and then for the domain-generalization experiment). As can be seen, stretchy patterns do significantly better than the best baseline.<br>
The Dialog Act tagset was taken from the Meeting Recorder Dialog Act (MRDA) tagset created by  Dhillon et al [1]. The training data used was unlabeled, whereas the test data was labeled by 2 human annotators. The training data for emails was a set of 23957 emails from W3C email corpus, while that for the discussion fora, was a set of 25,000 forum threads from the discussion fora of travel advising site TravelAdvisor. The test data for the emails was a set of 40 email threads from the BC3 corpus (Ulrich et. al.)[2], while that for discussion fora was a set of 200 forum threads. The dialog act categories labelled by human annotators had similar break-up in both the email set and the discussion thread set, as shown in the fig. below. The <math>\kappa</math> agreements between the two human annotators were 0.79 for email dataset and 0.73 for forum dataset.<br>
+
[[File:stretc_res_1.jpg]]<br>
[[File:dialog_act_categories.jpg]]<br>
+
[[File:stretc_res_2.jpg]]
'''Data Pre-processing'''<br>
 
Out of the email and forum data, fragment quotation graphs (FQGs) were created, as mentioned above.
 
 
 
==Graph-Theoretic Framework==
 
The FQG was then transformed into a similarity graph <math>G=(V,E)</math>, in which the sentences from the email or forum post would form the set of nodes and the nodes representing sentences in adjacent posts (as inferred from the FQG) would be joined with edges. Each edge would be assigned a weight, which would be some measure of similarity. A clustering of the nodes was then done with an assumption that sentences within the same cluster would represent the same dialog act. The clustering problem was modeled as an N-mincut graph clustering problem with the cut-criterion as below:<br>
 
<math>Ncut(A,B) = cut(A,B)/assoc(A,V) + cut(B,A)/assoc(B,V)</math><br>
 
where <math>cut(A,B)=cut(B,A)=\Sigma_{u \epsilon A, v \epsilon B}</math>  <math>w(u,v)</math> is the total connection from nodes in partition A to nodes in partition B, <math>assoc(A,V)=\Sigma_{u \epsilon A, v \epsilon V}  w(u,v)</math>  is the total connection from nodes in A to all other nodes in the graph; <math>assoc(B,V)</math> is defined similarly.
 
The authors experimented with a number of measures to find similarity between the sentences: A Bag-Of-Words based measure in which the similarity between two sentences will be the cosine similarity between the vector of TF-IDF scores of the words in the sentences;  A variant of BOW measure in which nouns are masked so as to prevent clustering based on topic rather than on dialog acts; A Word-Subsequence Kernel based measure which would transform the vector of words (POS tags for the experiments in this paper) to a higher-dimensional space and find the similarity in that space; An Extended WSK in which syntactic/semantic features of the words were used along with the words (POS tags, rather); A dependency-similarity based measure in which the similarity will be scored by finding number of co-occurring Basic Elements (BEs) in the dependency parse trees of the two sentences (A BE is a (head, modifier, relation) triple); A syntactic tree similarity measure using Tree Kernel function (Collins and Duffy)[3] to find the similarity between the sentences; And finally, a linear combination of all these measures. As baseline, all sentences were assumed to represent the dialog act "Statement", as Statement was the most frequently occurring dialog act in the annotated test set. The results of these experiments are present in the below table. For evaluation a 1-to-1 metric was used, in which the clusters in annotated test set were made to overlap with the clusters in the result until the pair-wise overlap between the clusters from the two sets would be maximum. The mean of percentage of this overlap for each cluster would then be reported as the final score. As can be seen none of the methods surpassed the score of the baseline method. Contrary to the expectation the BOW-M measure yielded worse results than BOW measure.<br>
 
[[File:results_graph_damodeling.jpg]]<br>
 
 
 
==Probabilistic Conversation Models==
 
The authors realized that graph theoretic framework might not be doing good, because it did not model the sequential structure of the conversations, and other important features like the speaker, relative position or length. For this reason, the authors then modeled the dialog acts using HMM with dialog acts being hidden states, emitting observable sentences. This modeling is shown in the figure below. A conversation <math>C{_k}</math> is a sequence of hidden Dialog Acts <math>D{_i}</math>; each <math>D{_i}</math> produces an observable sentence <math>X{_i}</math>; each <math>X{_i}</math> is represented by its bag-of-words or unigrams (shown in <math>W{_i}</math> plate), its speaker (<math>S{_i}</math>), its relative position i.e. position of the sentence in the post normalized by total no. of sentences in the post (<math>P{_i}</math>), and its length <math>L{_i}</math>.<br>
 
[[File:hmm_model.jpg]]<br>
 
A symmetric Dirichlet prior with <math>\alpha = 2</math> was placed on each of the six multinomials (i.e. the distributions over initial states, transitions, unigrams, speakers, position and length). The authors then computed the MAP estimate using Baum-Welch (EM) algorithm with forward-backward. Specifically, given n-th sequence <math>X_{n,1:T_n}</math>, forward-backward computes:<br>
 
[[File:eq_1.jpg]]<br>
 
Where the local evidence is given by: <br>
 
[[File:eq_2.jpg]]<br>
 
 
 
==HMM Plus Mixture Model==
 
Based on earlier work by (Ritter et. al.)[4] the authors modeled the emissions of HMM as a mixture of multinomials. This new model is presented in the figure below.<br>
 
[[File:hmm_mix.jpg]]<br>
 
In the final experiments, the no. of mixtures was set to 3, after experimenting with 1 to 5 no. of mixtures.
 
 
 
==Results==
 
The results are as in the table below. The 1-to-1 overlap scores are mentioned for Baseline model (all Statements), HMM and HMM+Mix models for email and discussion fora posts. The experiments were done with both temporal sequence of the posts and their sequence in FQG. As we see HMM+Mix model performs the best and beats the baseline with a significant margin.<br>
 
 
 
[[File:results_hmm_mix.jpg]]
 
 
 
==References==
 
[1] R. Dhillon, S. Bhagat, H. Carvey, and E. Shriberg. Meeting Recorder Project: Dialog Act Labeling Guide. Technical report, ICSI Tech. Report, 2004.<br>
 
[2] J. Ulrich, G. Murray, and G. Carenini. A publicly available annotated corpus for supervised email summarization. In EMAIL’08 Workshop. AAAI, 2008.<br>
 
[3] Michael Collins and Nigel Duffy. Convolution Kernels for Natural Language. In NIPS-2001, pages 625–632, Vancouver, Canada, 2001.<br>
 
[4] A. Ritter, C. Cherry, and B. Dolan. Unsupervised modeling of twitter conversations. In HLT: NAACL’10, LA, California, 2010. ACL<br>
 

Revision as of 02:11, 2 November 2011

Citation

Philip Gianfortoni, David Adamson, Carolyn P. Rosé, "Modeling of Stylistic Variation in Social Media with Stretchy Patterns", Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, 2011

Online version

Click here to download

Brief Summary

This paper presents a new technique for Automatic Pattern Extraction with a view to find new features for linguistic style modeling, especially for the cases where the data is less. The authors propound "stretchy patterns" which are information extraction patterns that allow for gaps of arbitrary length between the constituents of the pattern, and thus give more coverage to these patterns for a style modeling task. The authors show that these patterns are more effective than usual unigram, bigram and POS based features for the task of gender-based linguistic style classification. The authors also claim that using such patterns as features generalizes well over different domains.

Dataset

The dataset used was a set of blogs from Blog Authorship Corpus. Each post in the blog is labeled with some metadata like gender and occupation of the author. For showing that stretchy patterns are more effective than other usual features, a set of 150 post from male authors and 150 posts from female authors was taken for each of the 10 most common occupations as mentioned the blog corpus, thus making for a total of 3000 posts. For the task of showing domain generality for stretchy patterns, similar to the above dataset, a set of 100 posts by male authors and 100 by female were taken for each of these 10 categories, and a LOO CV was performed.

Stretchy Patterns

The authors first presented the notion of categories for stretchy patterns. Basically, each token or word in the corpus is assumed to be composed of a surface-form lexeme and any additional syntactic or semantic information about the word. Any of the available forms of a token is called a type. A category is a set of word-types. Each type must belong to at least one category. Gap is a special category, containing all types that aren’t part of any other category. The types belonging to any defined category may also be explicitly added to the Gap category.

A stretchy pattern, then, is defined as a sequence of categories, which must not begin or end with a Gap category. There can be any number of adjacent Gap instances in a pattern by the string “GAP+” and every other category instance by its label. The table below lists the word-categories that the authors identified.
Str cat list.jpg

As an example, see the following stretchy pattern which comprises of some of the identified categories along with the GAP category.

[cc] (GAP+) [adj] [adj]
“and (some clients were) kinda popular...”
“from (our) own general election...”


Extracting the Patterns

Patterns are extracted from the training set, using a sliding window over the token stream to generate all allowable combinations of category-gap sequences within the window. As this generates an exponential number of patterns, the authors first filtered this huge list of patterns by their accuracy and coverage. In the experiments, these thresholds were set to a minimum of 60% per-feature precision, and at least 15 document-level hits.

Experiments

The authors performed experiments on the dataset (as mentioned above) for two experiments: doing gender-based linguistic style variation classification, and showing domain-generalization of stretchy patterns as features. For both the experiments Unigram, Unigram+Bigram, and POS tags were chosen as three features to compare stretchy patterns with. As mentioned above, for the second experiment, a LOO CV was done, training on data from 9 categories and testing on the left out one, in each iteration. The results are as show in the tables below (first for the classification experiment, and then for the domain-generalization experiment). As can be seen, stretchy patterns do significantly better than the best baseline.
Stretc res 1.jpg
Stretc res 2.jpg