Modeling of Stylistic Variation in Social Media with Stretchy Patterns
Citation
Philip Gianfortoni, David Adamson, Carolyn P. Rosé, "Modeling of Stylistic Variation in Social Media with Stretchy Patterns", Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, 2011
Online version
Brief Summary
This paper presents a new technique for Automatic Pattern Extraction with a view to find new features for linguistic style modeling, especially for the cases where the data is less. The authors propound "stretchy patterns" which are information extraction patterns that allow for gaps of arbitrary length between the constituents of the pattern, and thus give more coverage to these patterns for a style modeling task. The authors show that these patterns are more effective than usual unigram, bigram and POS based features for the task of gender-based linguistic style classification. The authors also claim that using such patterns as features generalizes well over different domains.
Dataset
The dataset used was a set of blogs from Blog Authorship Corpus. Each post in the blog is labeled with some metadata like gender and occupation of the author. For showing that stretchy patterns are more effective than other usual features, a set of 150 post from male authors and 150 posts from female authors was taken for each of the 10 most common occupations as mentioned the blog corpus, thus making for a total of 3000 posts. For the task of showing domain generality for stretchy patterns, similar to the above dataset, a set of 100 posts by male authors and 100 by female were taken for each of these 10 categories, and a LOO CV was performed.
Stretchy Patterns
The authors first presented the notion of categories for stretchy patterns. Basically, each token or word in the corpus is assumed to be composed of a surface-form lexeme and any additional syntactic or semantic information about the word. Any of the available forms of a token is called a type. A category is a set of word-types. Each type must belong to at least one category. Gap is a special category, containing all types that aren’t part of any other category. The types belonging to any defined category may also be explicitly added to the Gap category.
A stretchy pattern, then, is defined as a sequence of categories, which must not begin or end with a Gap category. There can be any number of adjacent Gap instances in a pattern by the string “GAP+” and every other category instance by its label. The table below lists the word-categories that the authors identified.
As an example, see the following stretchy pattern which comprises of some of the identified categories along with the GAP category.
[cc] (GAP+) [adj] [adj]
“and (some clients were) kinda popular...”
“from (our) own general election...”
Extracting the Patterns
Patterns are extracted from the training set, using a sliding window over the token stream to generate all allowable combinations of category-gap sequences within the window. As this generates an exponential number of patterns, the authors first filtered this huge list of patterns by their accuracy and coverage. In the experiments, these thresholds were set to a minimum of 60% per-feature precision, and at least 15 document-level hits.
Experiments
The authors performed experiments on the dataset (as mentioned above) for two experiments: doing gender-based linguistic style variation classification, and showing domain-generalization of stretchy patterns as features. For both the experiments Unigram, Unigram+Bigram, and POS tags were chosen as three features to compare stretchy patterns with. As mentioned above, for the second experiment, a LOO CV was done, training on data from 9 categories and testing on the left out one, in each iteration. The results are as show in the tables below (first for the classification experiment, and then for the domain-generalization experiment). As can be seen, stretchy patterns do significantly better than the best baseline.