Eisenstein et al ACL 2011. Discovering Sociolinguistic Associations with Structured Sparsity

From Cohen Courses
Jump to navigationJump to search

Citation

Jacob Eisenstein, Noah A. Smith and Eric P. Xing Discovering Sociolinguistic Associations with Structured Sparsity in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland

Online Version

Online Pdf

Summary

This Paper studies the influence of demography over language (Influence of non-linguistic factors over language usage). In other words, it tries to identify the lexical variations with respect to certain demographic attributes (race or ethnicity, socioeconomic status, language spoken etc). Modelling sociolinguistic association is a complex problem because of the large number of possible interactions involved. Using multi-output regression with structured sparsity, this method identifies a small subset of words that are most influenced by demographics and also discovers sets of demographic attributes that influence variation in lexical items. There is a related paper "A Latent Variable Model for Geographic Lexical Variation", Eisenstein (2010) which studies the influence of geography on language usage using the same dataset used by this work.

This problem can be (and has been) studied in quantitative sociolinguistics through carefully designed experiments (For example, the relation between income and the "dropped-r" New York accent). But all of these approaches require that both the lingustic attributes and the demographic attributes to be identified beforehand. This paper presents a method to acquire such patterns from raw data.

Data

This paper uses the same GeoTagged Twitter Dataset they used in one of their previous work. The vocabulary were limited to 5418 terms which were used by atleast 40 authors. Also no stoplists were applied since the use of standard or non-standard orthography (phrase 'went 2' instead of 'went to' conveys important information about the author.In another of their previous work O'Connor (2011), an aggregate demographic statistics for the data was obtained by mapping geolocations (obtained from the tweets) to publicly available data from the U.S. Census ZIP Code Tabulation Areas (ZCTA). The demographic attributes which are considered are listed down below :


Demography.jpg

Method

As noted earlier, this problem views the sociolinguistics problem as a multi-output regression problem between demographic and lexical frequencies. Two variation of regression studies are performed. First the term frequencies are used as an input and are used to predict demographic attributes resulting in a compact set of words that are strongly associated with author demographics. Secondly conjunction of demographic attributes (for eg socioeconomic status & ethnicity) African American and Renter, for example are used to predict lexical frequencies.

The following section describes the formal model for output-regression with structured sparsity.

Formal Model (from the paper)

The following linear equation is considered

= + where,

  • is the dependent variable matrix, with dimensions where is the number of samples (training data) and is the number of output dimensions (or tasks);
  • is the independent variable matrix, with dimensions , where is the number of input dimensions (or predictors, which will change depending on the (one of the two) regression problem)
  • is the matrix of regression coefficients, with dimensions is a matrix in which each element is noise from a zero-mean Gaussian distribution.

The next problem is to solve the unconstrained optimization problem,

  • where indicates the squared Frobenius norm and the function R(B) defines a norm on the regression coefficients . Applying both Ridge regression () norm or lasso regression () norm, it is possible to decompose the multi output regression problem, treating each output dimension separately. However in practice, there will be a lot of correlation between the lexical terms and the demographic features (for example, there will be a lot of words for the Spanish speaking demographic group). But the main goal is to select a small set of predictors (words or demographic features) which can be considered to be a representative. One way to look at this is to have a regularizer which drives entire rows of coefficient matrix to be zero. This is called structured sparsity. It is not achieved by the lasso's norm which achieves element-wise sparsity (ie many entries of the coefficient matrix are driven to zero but some might have non zero for some dimension). To drive entire rows of B to zero we consider the norm which is the sum of norms across output dimensions. This norm which corresponds to a multi output lasso regression has the desired property of driving entire rows of B to zero.

Application of the model

  • For predicting demographic features from words : The multi output regression method can be used to select a small subset of vocabulary items that are especially indicative of demographic and geographic differences. The predictors X are set to the term frequencies with one column for each word type (P) and each row for each author in the dataset (N)
  • For predicting lexical frequencies from demographic features : For this regression problem, the direction are reversed accordingly.

Evaluation and results

The ability of lexical features to predict the demographic attributes of their authors is evaluated. The purpose of this evaluation was to assess the predictive ability of the compact subset of words identified by the multi-output lasso technique, as compared with the full vocabulary. A five fold cross-validation was performed using the multi-output lasso to identify a sparse feature set in the training data. The results are compared against several dimensionality reduction technique namely truncated singular value decomposition (with truncation level set to the number of items selected by milti-output lasso), selecting the N most frequent terms and and the N terms with the greatest variance in frequency across authors. Finally it is compared to the full vocabulary set. The scoring metric is Pearson’s correlation coefficient between the predicted and true demographics. The following figure shows the correlations obtained by regressions performed on a range of different vocabularies averaged across all five folds.

Avg correlation.jpg

An interesting observation is that linguistic features are best at predicting race/ethnicity, language and the proportion of renters. Among feature sets, the highest average correlation is obtained by the full vocabulary, but the multioutput lasso obtains nearly identical performance using a feature set that is far smaller as compared to full vocabulary. Thus the multioutput lasso achieves a 93% compression of feature set without a significant decrease in predictive performance, a promising result obtained by the paper.

In the paper a table of the most demographically-indicative terms are listed. A few observations are made from the data.

  • English and Spanish words are spoken more in places with their corresponding population (understandably!)
  • Emoticons tend to be used in areas with many Hispanics and few African Americans.
  • Some abbreviations have a nearly uniform demographic profile, displaying negative correlations with whites and English speakers, and positive correlations with African Americans, Hispanics, renters, Spanish speakers, and areas classified as urban.


Result16.jpg

The above figure shows the variation of lexical terms with various combination of demographical features (including geography of the place).

The following interesting analysis of the above data are done in the paper.

  • Geography acts as a strong feature (25 of the 37 selected combinations). F1 and F2 are purely defined by geography, capturing the northeastern United States and the New York City area
  • Several features occur along with geography, F9 divides the New York City area by focusing on communities that have relatively low numbers of Spanish speakers; F17 emphasizes New York neighborhoods that have very high numbers of African Americans and few speakers of languages other than English and Spanish. These conjunctions allow us to gain an insight on various communities within a city.
  • Another important observation made by the paper is that the proportion of African Americans, which strongly suggests that African American vernacular english plays an important role over spoken language. (This findings by the technique is particularly fascinating to me!)
  • The socioeconomic attributes appear in far fewer features. The most prevalent attribute is the proportion of renters which appears in F4 and F7. This suggest that this could be a better indicator than the "%urban" feature.

A small comment

The paper indeed does an interesting analysis of lexical variation with respect to demographies. It would be interesting to also observe the lexical variation with respect to age since there is indeed a difference in how people of different age groups express themselves. (I guess the paper couldnot do so because age is independent of geography and hence would not reflect on dataset, instead an interesting analysis would be to analyze data coming from (or close by to) universities (more youth population) and from office districts (more adult population) etc)

Study Plan

To understand this paper you should know about