Difference between revisions of "Eisenstein et al ACL 2011. Discovering Sociolinguistic Associations with Structured Sparsity"

From Cohen Courses
Jump to navigationJump to search
Line 33: Line 33:
 
The next problem is to solve the unconstrained optimization problem,
 
The next problem is to solve the unconstrained optimization problem,
  
*<math>arg  min_{B} ||Y - XB||^2 + \lambda R(B) where ||A||^2 indicates the squared Frobenius norm and the function R(B) defines a norm on the regression coefficients B </math>
+
*<math>arg  min_{B} ||Y - XB||^2 + \lambda R(B) </math>where <math> ||A||^2 </math> indicates the squared [http://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm Frobenius norm] and the function R(B) defines a norm on the regression coefficients <math> B </math>

Revision as of 22:44, 30 September 2012

Citation

Jacob Eisenstein, Noah A. Smith and Eric P. Xing Discovering Sociolinguistic Associations with Structured Sparsity in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland

Online Version

Online Pdf

Summary

This Paper studies the influence of demography over language. In other words, it tries to identify the lexical variations with respect to certain demographic attributes (race or ethnicity, socioeconomic status, language spoken etc). Modelling sociolinguistic association is a complex problem because of the large number of possible interactions involved. Using multi-output regression with structured sparsity, this method identifies a small subset of words that are most influenced by demographics and also discovers conjunction of demographic attributes that influence variation in lexical items.

This problem can be studied in quantitative sociolinguistics through carefully designed experiments. But all of these approaches require that both the lingustic attributes and the demographic attributes to be identified beforehand. This paper presents a method to acquire such patterns from raw data.

Data

This paper uses the same GeoTagged Twitter Dataset they used in one of their previous work. The vocabulary were limited to 5418 terms which were used by atleast 40 authors. Also no stoplists were applied since the use of standard or non-standard orthography (phrase 'went 2' instead of 'went to' conveys important information about the author.In another of their previous work, an aggregate demographic statistics for the data was obtained by mapping geolocations (obtained from the tweets) to publicly available data from the U.S. Census ZIP Code Tabulation Areas (ZCTA). The demographic attributes which are considered are listed down below :


Demography.jpg

Method

As noted earlier, this problem views the sociolinguistics problem as a multi-output regression problem between demographic and lexical frequencies. Two variation of regression studies are performed. First the lexical frequencies are used as an input and are used to predict demographic attributes resulting in identifying a compact set of words that are strongly associated with author demographics. Secondly demographic attributes are conjoined into features which are used to predict lexical frequencies.

The following section describes the formal model for output-regression with structured sparsity.

Formal Model

The following linear equation is considered

= + where,

  • is the dependent variable matrix, with dimensions where is the number of samples (training data) and is the number of output dimensions (or tasks);
  • is the independent variable matrix, with dimensions , where is the number of input dimensions (or predictors, which will change depending on the (one of the two) regression problem)
  • is the matrix of regression coefficients, with dimensions is a matrix in which each element is noise from a zero-mean Gaussian distribution.

The next problem is to solve the unconstrained optimization problem,

  • where indicates the squared Frobenius norm and the function R(B) defines a norm on the regression coefficients