Difference between revisions of "Eisenstein et al ACL 2011. Discovering Sociolinguistic Associations with Structured Sparsity"

From Cohen Courses
Jump to navigationJump to search
Line 66: Line 66:
 
== Study Plan ==
 
== Study Plan ==
  
Read about
+
To understand this paper you should know about
 +
 
 
* [http://en.wikipedia.org/wiki/Regression_analysis regressions]
 
* [http://en.wikipedia.org/wiki/Regression_analysis regressions]
 
* [http://en.wikipedia.org/wiki/Matrix_norm Matrix Norms]
 
* [http://en.wikipedia.org/wiki/Matrix_norm Matrix Norms]
 
* Regularization (more specifically the [http://en.wikipedia.org/wiki/Lasso_(statistics)#LASSO_method LASSO] method)
 
* Regularization (more specifically the [http://en.wikipedia.org/wiki/Lasso_(statistics)#LASSO_method LASSO] method)

Revision as of 21:04, 3 October 2012

Citation

Jacob Eisenstein, Noah A. Smith and Eric P. Xing Discovering Sociolinguistic Associations with Structured Sparsity in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland

Online Version

Online Pdf

Summary

This Paper studies the influence of demography over language (Influence of non-linguistic factors over language usage). In other words, it tries to identify the lexical variations with respect to certain demographic attributes (race or ethnicity, socioeconomic status, language spoken etc). Modelling sociolinguistic association is a complex problem because of the large number of possible interactions involved. Using multi-output regression with structured sparsity, this method identifies a small subset of words that are most influenced by demographics and also discovers sets of demographic attributes that influence variation in lexical items.

This problem can be (and has been) studied in quantitative sociolinguistics through carefully designed experiments (For example, the relation between income and the "dropped-r" New York accent). But all of these approaches require that both the lingustic attributes and the demographic attributes to be identified beforehand. This paper presents a method to acquire such patterns from raw data.

Data

This paper uses the same GeoTagged Twitter Dataset they used in one of their previous work. The vocabulary were limited to 5418 terms which were used by atleast 40 authors. Also no stoplists were applied since the use of standard or non-standard orthography (phrase 'went 2' instead of 'went to' conveys important information about the author.In another of their previous work, an aggregate demographic statistics for the data was obtained by mapping geolocations (obtained from the tweets) to publicly available data from the U.S. Census ZIP Code Tabulation Areas (ZCTA). The demographic attributes which are considered are listed down below :


Demography.jpg

Method

As noted earlier, this problem views the sociolinguistics problem as a multi-output regression problem between demographic and lexical frequencies. Two variation of regression studies are performed. First the lexical frequencies are used as an input and are used to predict demographic attributes resulting in identifying a compact set of words that are strongly associated with author demographics. Secondly demographic attributes are conjoined into features which are used to predict lexical frequencies.

The following section describes the formal model for output-regression with structured sparsity.

Formal Model

The following linear equation is considered

= + where,

  • is the dependent variable matrix, with dimensions where is the number of samples (training data) and is the number of output dimensions (or tasks);
  • is the independent variable matrix, with dimensions , where is the number of input dimensions (or predictors, which will change depending on the (one of the two) regression problem)
  • is the matrix of regression coefficients, with dimensions is a matrix in which each element is noise from a zero-mean Gaussian distribution.

The next problem is to solve the unconstrained optimization problem,

  • where indicates the squared Frobenius norm and the function R(B) defines a norm on the regression coefficients . Applying both Ridge regression () norm or lasso regression () norm, it is possible to decompose the multi output regression problem, treating each output dimension separately. However in practice, there will be a lot of correlation between the lexical terms and the demographic features (for example, there will be a lot of words for the Spanish speaking demographic group). But the main goal is to select a small set of predictors (words or demographic features) which can be considered to be a representative. One way to look at this is to have a regularizer which drives entire rows of coefficient matrix to be zero. This is called structured sparsity. It is not achieved by the lasso's norm which achieves element-wise sparsity (ie many entries of the coefficient matrix are driven to zero but some might have non zero for some dimension). To drive entire rows of B to zero we consider the norm which is the sum of norms across output dimensions. This norm which corresponds to a multi output lasso regression has the desired property of driving entire rows of B to zero.

Application of the model

  • For predicting demographic features from words : The multi output regression method can be used to select a small subset of vocabulary items that are especially indicative of demographic and geographic differences. The predictors X are set to the term frequencies with one column for each word type (P) and each row for each author in the dataset (N)
  • For predicting lexical frequencies from demographic features : For this regression problem, the direction of inference are reversed.

Evaluation and results

The ability of lexical features to predict the demographic attributes of their authors is evaluated. The purpose of this evaluation was to assess the predictive ability of the compact subset of lexical items identified by the multi-output lasso, as compared with the full vocabulary. A five fold cross-validation was performed using the multi-output lasso to identify a sparse feature set in the training data. The results are compared against several dimensionality reduction technique namely truncated singular value decomposition (with truncation level set to the number of items selected by milti-output lasso), selecting the N most frequent terms and and the N terms with the greatest variance in frequency across authors. Finally it is compared to the full vocabulary set. The scoring metric is Pearson’s correlation coefficient between the predicted and true demographics. The following figure shows the correlations obtained by regressions performed on a range of different vocabularies averaged across all five folds.

Avg correlation.jpg

As we can see linguistic features are best at predicting race/ethnicity, language and the proportion of renters. Among feature sets, the highest average correlation is obtained by the full vocabulary, but the multioutput lasso obtains nearly identical performance using a feature set that is an order of magnitude smaller. Thus the multioutput lasso achieves a 93% compression of feature set without a significant decrease in predictive performance.

In the paper a table of the most demographically-indicative terms are listed. A few observations are made from the data. Standard English words tend to appear in areas with more English speakers; predictably, Spanish words tend to appear in areas with Spanish speakers and Hispanics. Emoticons tend to be used in areas with many Hispanics and few African Americans. Abbreviations (e.g., lmaoo) have a nearly uniform demographic profile, displaying negative correlations with whites and English speakers, and positive correlations with African Americans, Hispanics, renters, Spanish speakers, and areas classified as urban. Also many non standard english words (e.g., dats) appear in areas with high proportions of renters, African Americans, and non-English speakers, though a subset (haha, hahaha, and yep) display the opposite demographic pattern.


Result16.jpg

The above figure shows the variation of lexical terms with various combination of demographical features (including geography of the place). Geography acts as a strong predictor appearing in 25 of the 37 selected combinations. The following interesting analysis of the above data are done in the paper. Features 1 and 2 (F1 and F2) are purely geographical, capturing the northeastern United States and the New York City area. The geographical area of F2 is completely contained by F1; the associated terms are thus very similar, but by having both features, the model can distinguish terms which are used in northeastern areas outside New York City, as well as terms which are especially likely in New York. Several features conjoin geography with demographic attributes. For example, F9 further refines the New York City area by focusing on communities that have relatively low numbers of Spanish speakers; F17 emphasizes New York neighborhoods that have very high numbers of African Americans and few speakers of languages other than English and Spanish. The regression model can use these features in combination to make fine grained distinctions about differences in neighborhood. Outside New York, F4 combines a broad geographic area with attributes that select at least moderate levels of minorities and fewer renters. F15 identifies West Coast communities with large numbers of speakers of other languages except English and Spanish. Another important observation made by the paper is that the proportion of African American appeared in 22 of these features, strongly suggesting that African American Vernacular english plays an important role in social media text. While race, geography, and language predominate, the socioeconomic attributes appear in far fewer features. The most prevalent attribute is the proportion of renters, which appears in F4 and F7. The attribute may be a better indicator of the urban/rural divide than the "%urban" attribute. Overall the selected features tend to include attributes that are easy to predict from text.

A small comment

The paper indeed does an interesting analysis of lexical variation with respect to demographies. It would be interesting to also observe the lexical variation with respect to age since there is indeed a difference in how people of different age groups express themselves.

Study Plan

To understand this paper you should know about