Eisenstein et al ACL 2011. Discovering Sociolinguistic Associations with Structured Sparsity

From Cohen Courses
Revision as of 20:55, 30 September 2012 by Rajarshd (talk | contribs) (→‎Data)
Jump to navigationJump to search

Citation

Jacob Eisenstein, Noah A. Smith and Eric P. Xing Discovering Sociolinguistic Associations with Structured Sparsity in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland

Online Version

Online Pdf

Summary

This Paper studies the influence of demography over language. In other words, it tries to identify the lexical variations with respect to certain demographic attributes (race or ethnicity, socioeconomic status, language spoken etc). Modelling sociolinguistic association is a complex problem because of the large number of possible interactions involved. Using multi-output regression with structured sparsity, this method identifies a small subset of words that are most influenced by demographics and also discovers conjunction of demographic attributes that influence variation in lexical items.

This problem can be studied in quantitative sociolinguistics through carefully designed experiments. But all of these approaches require that both the lingustic attributes and the demographic attributes to be identified beforehand. This paper presents a method to acquire such patterns from raw data.

Data

This paper uses the same GeoTagged Twitter Dataset they used in one of their previous work. The vocabulary were limited to 5418 terms which were used by atleast 40 authors. Also no stoplists were applied since the use of standard or non-standard orthography (phrase 'went 2' instead of 'went to' conveys important information about the author.In another of their previous work, an aggregate demographic statistics for the data was obtained by mapping geolocations (obtained from the tweets) to publicly available data from the U.S. Census ZIP Code Tabulation Areas (ZCTA). The demographic attributes which are considered are listed down below :


Demography.jpg