Modeling of Stylistic Variation in Social Media with Stretchy Patterns

Citation

Philip Gianfortoni, David Adamson, Carolyn P. Rosé, "Modeling of Stylistic Variation in Social Media with Stretchy Patterns", Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, 2011

Online version

Click here to download

Brief Summary

This paper presents a new technique for Automatic Pattern Extraction with a view to find new features for linguistic style modeling, especially for the cases where the data is less. The authors propound "stretchy patterns" which are information extraction patterns that allow for gaps of arbitrary length between the constituents of the pattern, and thus give more coverage to these patterns for a style modeling task. The authors show that these patterns are more effective than usual unigram, bigram and POS based features for the task of gender-based linguistic style classification. The authors also claim that using such patterns as features generalizes well over different domains.

Dataset

The dataset used was a set of blogs from Blog Authorship Corpus. Each post in the blog is labeled with some metadata like gender and occupation of the author. For showing that stretchy patterns are more effective than other usual features, a set of 150 post from male authors and 150 posts from female authors was taken for each of the 10 most common occupations as mentioned the blog corpus, thus making for a total of 3000 posts. For the task of showing domain generality for stretchy patterns, similar to the above dataset, a set of 100 posts by male authors and 100 by female were taken for each of these 10 categories, and a LOO CV was performed.

Stretchy Patterns

The authors first presented the notion of cateho We define a document as an ordered list of tokens. Each token is composed of a surface-form lexeme and any additional syntactic or semantic information about the word at this position (in our case this is simply the POS tag, but other layers such as Named Entity might be included). We refer to any of the available forms of a token as a type. A category is a set of word-types. Each type must belong to at least one category. All categories have a corresponding label, by which they’ll be referred to within the patterns to come. Gap is a special category, containing all types that aren’t part of any other category. The types belonging to any defined category may also be explicitly added to the Gap category. A stretchy pattern is defined as a sequence of categories, which must not begin or end with a Gap category. We designate any number of adjacent Gap instances in a pattern by the string “GAP+” 1 and every other category instance by its label. As a convention, the label of a singleton category is the name of the type contained in the category (thus "writes" would be the label of a category containing only surface form "writes" and "VBZ" would be the label of the a category containing only the POS tag "VBZ"). The overall number of Gap and non-Gap category instances comprising a pattern is restricted - following Tsur (2010), we allow no more than six tokens of either category. In the case of Gap instances, this restriction is placed on the number of underlying tokens, and not the collapsed GAP+ form. A sequence of tokens in a document matches a pattern if there is some expansion where each token corresponds in order to the pattern’s categories. A given instance of GAP+ will match between zero and six tokens, provided the total number of Gap instances in the pattern do not exceed six 2 . By way of example, two patterns follow, with two strings that match each. Tokens that match as Gaps are shown in parenthesis. [cc] (GAP+) [adj] [adj] “and (some clients were) kinda popular...” “from (our) own general election...” for (GAP+) [third-pron] (GAP+) [end] [first-pron] “ready for () them (to end) . I am...” “for (murdering) his (prose) . i want…” Although the matched sequences vary in length and content, the stretchy patterns preserve information about the proximity and ordering of particular words and categories. They focus on the relationship between key (non-Gap) words, and allow a wide array of sequences to be matched by

1 This is actually an extractor parameter, but we collapse all adjacent gaps for all our experiments. 2 The restrictions on gaps are extractor parameters, but we picked zero to six gaps for our experiments. a single pattern in a way that traditional word-class n-grams would not. Our “stretchy pattern” formalism strictly subsumes Tsur’s approach in terms of representational power. In particular, we could generate the same patterns described in Tsur (2010) by creating a singleton surface form category for each word in Tsur’s HFW and then creating a category called [CW] that contains all of the words in the Tsur CW set, in addition to the domain-specific product/manufacturer categories Tsur employed. Label Category Members adj JJ, JJR, JJS cc CC, IN md MD end <period>, <comma>, <question>, <exclamation> first-pron I, me, my, mine, im, I’m second-pron you, your, youre, you’re, yours, y’all third-pron he, him emotional feel, hurt, lonely, love time hour, hours, late, min, minute, minutes, months, schedule, seconds, time, years, male_curse fucking, fuck, jesus, cunt, fucker female_curse god, bloody, pig, hell, bitch, pissed, assed, shit Table 1. Word Categories 3.3 Word Categories With the aim of capturing general usage patterns, and motivated by the results of corpus linguists and discourse analysts, a handful token categories were defined, after the fashion of the LIWC categories as discussed in Gill (2009). Tokens belonging to categories may be replaced with their category label as patterns are extracted from each document. As a token might belong to multiple categories, the same token sequence may generate, and therefore match multiple patterns. Words from a list of 800 common prepositions, conjunctions, adjectives, and adverbs were included as singleton surface-form categories. Determiners in particular are absent from this list (and from the POS categories that follow), as their absence or presence in a noun phrase is one of the primary variations the stretchy gaps of our patterns were intended to smooth over. A handful of POS categories were selected, r

Data-set

Data
The Dialog Act tagset was taken from the Meeting Recorder Dialog Act (MRDA) tagset created by Dhillon et al [1]. The training data used was unlabeled, whereas the test data was labeled by 2 human annotators. The training data for emails was a set of 23957 emails from W3C email corpus, while that for the discussion fora, was a set of 25,000 forum threads from the discussion fora of travel advising site TravelAdvisor. The test data for the emails was a set of 40 email threads from the BC3 corpus (Ulrich et. al.)[2], while that for discussion fora was a set of 200 forum threads. The dialog act categories labelled by human annotators had similar break-up in both the email set and the discussion thread set, as shown in the fig. below. The $\kappa$ agreements between the two human annotators were 0.79 for email dataset and 0.73 for forum dataset.

Data Pre-processing
Out of the email and forum data, fragment quotation graphs (FQGs) were created, as mentioned above.

Graph-Theoretic Framework

The FQG was then transformed into a similarity graph $G=(V,E)$ , in which the sentences from the email or forum post would form the set of nodes and the nodes representing sentences in adjacent posts (as inferred from the FQG) would be joined with edges. Each edge would be assigned a weight, which would be some measure of similarity. A clustering of the nodes was then done with an assumption that sentences within the same cluster would represent the same dialog act. The clustering problem was modeled as an N-mincut graph clustering problem with the cut-criterion as below:
$Ncut(A,B)=cut(A,B)/assoc(A,V)+cut(B,A)/assoc(B,V)$
where $cut(A,B)=cut(B,A)=\Sigma _{u\epsilon A,v\epsilon B}$ $w(u,v)$ is the total connection from nodes in partition A to nodes in partition B, $assoc(A,V)=\Sigma _{u\epsilon A,v\epsilon V}w(u,v)$ is the total connection from nodes in A to all other nodes in the graph; $assoc(B,V)$ is defined similarly. The authors experimented with a number of measures to find similarity between the sentences: A Bag-Of-Words based measure in which the similarity between two sentences will be the cosine similarity between the vector of TF-IDF scores of the words in the sentences; A variant of BOW measure in which nouns are masked so as to prevent clustering based on topic rather than on dialog acts; A Word-Subsequence Kernel based measure which would transform the vector of words (POS tags for the experiments in this paper) to a higher-dimensional space and find the similarity in that space; An Extended WSK in which syntactic/semantic features of the words were used along with the words (POS tags, rather); A dependency-similarity based measure in which the similarity will be scored by finding number of co-occurring Basic Elements (BEs) in the dependency parse trees of the two sentences (A BE is a (head, modifier, relation) triple); A syntactic tree similarity measure using Tree Kernel function (Collins and Duffy)[3] to find the similarity between the sentences; And finally, a linear combination of all these measures. As baseline, all sentences were assumed to represent the dialog act "Statement", as Statement was the most frequently occurring dialog act in the annotated test set. The results of these experiments are present in the below table. For evaluation a 1-to-1 metric was used, in which the clusters in annotated test set were made to overlap with the clusters in the result until the pair-wise overlap between the clusters from the two sets would be maximum. The mean of percentage of this overlap for each cluster would then be reported as the final score. As can be seen none of the methods surpassed the score of the baseline method. Contrary to the expectation the BOW-M measure yielded worse results than BOW measure.

Probabilistic Conversation Models

The authors realized that graph theoretic framework might not be doing good, because it did not model the sequential structure of the conversations, and other important features like the speaker, relative position or length. For this reason, the authors then modeled the dialog acts using HMM with dialog acts being hidden states, emitting observable sentences. This modeling is shown in the figure below. A conversation $C{_{k}}$ is a sequence of hidden Dialog Acts $D{_{i}}$ ; each $D{_{i}}$ produces an observable sentence $X{_{i}}$ ; each $X{_{i}}$ is represented by its bag-of-words or unigrams (shown in $W{_{i}}$ plate), its speaker ( $S{_{i}}$ ), its relative position i.e. position of the sentence in the post normalized by total no. of sentences in the post ( $P{_{i}}$ ), and its length $L{_{i}}$ .

A symmetric Dirichlet prior with $\alpha =2$ was placed on each of the six multinomials (i.e. the distributions over initial states, transitions, unigrams, speakers, position and length). The authors then computed the MAP estimate using Baum-Welch (EM) algorithm with forward-backward. Specifically, given n-th sequence $X_{n,1:T_{n}}$ , forward-backward computes:

Where the local evidence is given by:

HMM Plus Mixture Model

Based on earlier work by (Ritter et. al.)[4] the authors modeled the emissions of HMM as a mixture of multinomials. This new model is presented in the figure below.

In the final experiments, the no. of mixtures was set to 3, after experimenting with 1 to 5 no. of mixtures.

Results

The results are as in the table below. The 1-to-1 overlap scores are mentioned for Baseline model (all Statements), HMM and HMM+Mix models for email and discussion fora posts. The experiments were done with both temporal sequence of the posts and their sequence in FQG. As we see HMM+Mix model performs the best and beats the baseline with a significant margin.

References

[1] R. Dhillon, S. Bhagat, H. Carvey, and E. Shriberg. Meeting Recorder Project: Dialog Act Labeling Guide. Technical report, ICSI Tech. Report, 2004.
[2] J. Ulrich, G. Murray, and G. Carenini. A publicly available annotated corpus for supervised email summarization. In EMAIL’08 Workshop. AAAI, 2008.
[3] Michael Collins and Nigel Duffy. Convolution Kernels for Natural Language. In NIPS-2001, pages 625–632, Vancouver, Canada, 2001.
[4] A. Ritter, C. Cherry, and B. Dolan. Unsupervised modeling of twitter conversations. In HLT: NAACL’10, LA, California, 2010. ACL

Modeling of Stylistic Variation in Social Media with Stretchy Patterns

Contents

Citation

Online version

Brief Summary

Dataset

Stretchy Patterns

Data-set

Graph-Theoretic Framework

Probabilistic Conversation Models

HMM Plus Mixture Model

Results

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools