Difference between revisions of "A Latent Variable Model for Geographic Lexical Variation"

From Cohen Courses
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] aims to [[AddressesProblems::analyze the variation in the usage of words in vernacular wrt geography]]. In particular, it analyzes lexical variation by both topic and geography. It also separates regions into coherent linguistic communities. Also it can predict with some accuracy the location of the author from raw text.
+
This [[Category::paper]] aims to analyze the variation in the usage of words in vernacular wrt geography ([[AddressesProblem::Influence of non-linguistic factors over language usage]]). In particular, it analyzes lexical variation by both topic and geography. It also separates regions into coherent linguistic communities. Also it can predict with some accuracy the location of the author from raw text.
  
They develop a model that incorporates two sources of lexical variation : topic and geographical region, both as latent variables. At the base level of the model are (as referred in the paper) "pure" topics (such as sports, weather, slangs) and these topics are used differently in different geographic regions.
+
This paper develops a model that incorporates two sources of lexical variation : topic and geographical region, both as latent variables. At the base level of the model are "pure" topics (such as sports, weather, slangs) and these topics are used differently in different geographic regions.
  
 
== Data ==
 
== Data ==
  
This work is based on the [[UsesDataset::Twitter]] dataset which can be found [http://www.ark.cs.cmu.edu/GeoText/ here]. Only GeoTagged data is used. Also they choose users based on certain criterias such as, they should be active on twitter (wrote atleast 20 messages over the period) and should follow less than 1000 people and have less than 1000 followers (so they are not celebrities or influential people)
+
This work is based on the [[UsesDataset::GeoTagged Twitter Dataset]] dataset which can be found [http://www.ark.cs.cmu.edu/GeoText/ here]. Only GeoTagged data is used. Also they choose users based on certain criterias such as, they should be active on twitter (wrote atleast 20 messages over the period) and should follow less than 1000 people and have less than 1000 followers (so they are not celebrities or influential people)
  
 
== Discussion ==
 
== Discussion ==
Line 28: Line 28:
 
== Analysis and Results ==
 
== Analysis and Results ==
  
They present an interesting analysis of a subset of result (one randomly initialized run of the system with 5 (of 50) hand chosen topics and 5 (of 13) regions). For topics relating to sports (such as basketball) names of teams, sports person etc are most common from where they belong to indicating an encouraging result. Similarly for topics which are more conversational (such as daily life, emoticons, chit-chat, slangs) there was a distinct geographic variation which was observed for words which are used to express the same thing. Also, there were a lot of spanish words in regions of high spanish population. There are other terms which refer to object that were especially relevant to certain regions (for eg cab in NYC)
+
They present an interesting analysis of a subset of result (one randomly initialized run of the system with 5 (of 50) hand chosen topics and 5 (of 13) regions). For topics relating to sports (such as basketball) names of teams, sports person etc are most common from where they belong to indicating an encouraging result. Similarly for topics which are more conversational (such as daily life, emoticons, chit-chat, slangs), there was a distinct geographic variation which was observed for words which are used to express the same thing (such as 'koo' or 'coo' for the same word to express 'cool'). Also there were other terms which refer to object that were especially relevant to certain regions (for eg cab in NYC). Population of an ethnicity also played a role as Spanish language terms appeared in regions with large Spanish speaking population.
 +
 
 +
The geographic topic model that they proposed achieved the strongest performance both on regression and classification accuracy as compared to Supervised LDA or Text Regression methods.
  
 
== Related Paper ==
 
== Related Paper ==
Line 35: Line 37:
  
 
== Study Plan ==
 
== Study Plan ==
 +
To understand this paper you might want to read
  
[http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf Latent Dirichlet Allocation]
+
*this seminal paper on [http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf Latent Dirichlet Allocation]
  
[http://en.wikipedia.org/wiki/Normal_distribution Normal Distribution]
+
*[http://en.wikipedia.org/wiki/Normal_distribution Normal Distribution]
  
[http://en.wikipedia.org/wiki/Dirichlet_distribution Dirichlet Distribution]
+
*[http://en.wikipedia.org/wiki/Dirichlet_distribution Dirichlet Distribution]

Latest revision as of 20:54, 3 October 2012

Citation

A Latent Variable Model for Geographic Lexical Variation. Jacob Eisenstein, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), Cambridge, MA, October 2010.

Online version

Pdf of the paper

Summary

This paper aims to analyze the variation in the usage of words in vernacular wrt geography (Influence of non-linguistic factors over language usage). In particular, it analyzes lexical variation by both topic and geography. It also separates regions into coherent linguistic communities. Also it can predict with some accuracy the location of the author from raw text.

This paper develops a model that incorporates two sources of lexical variation : topic and geographical region, both as latent variables. At the base level of the model are "pure" topics (such as sports, weather, slangs) and these topics are used differently in different geographic regions.

Data

This work is based on the GeoTagged Twitter Dataset dataset which can be found here. Only GeoTagged data is used. Also they choose users based on certain criterias such as, they should be active on twitter (wrote atleast 20 messages over the period) and should follow less than 1000 people and have less than 1000 followers (so they are not celebrities or influential people)

Discussion

The twitter feed for each user (author) is collected over a period of time to form a document. For each author, the latent variable is the geographical region which is not observed. A Cascading Topic Model is used which generates text from a chain of random variables. Each element in the chain defines a distribution over words and acts as the mean of the distribution over subsequent element in the chain thus corrupting at each level.

On a high level, the model does the following :

  • Generate base topics
  • Generate regional variants
  • Generate regions
  • Generate text and location

Analysis and Results

They present an interesting analysis of a subset of result (one randomly initialized run of the system with 5 (of 50) hand chosen topics and 5 (of 13) regions). For topics relating to sports (such as basketball) names of teams, sports person etc are most common from where they belong to indicating an encouraging result. Similarly for topics which are more conversational (such as daily life, emoticons, chit-chat, slangs), there was a distinct geographic variation which was observed for words which are used to express the same thing (such as 'koo' or 'coo' for the same word to express 'cool'). Also there were other terms which refer to object that were especially relevant to certain regions (for eg cab in NYC). Population of an ethnicity also played a role as Spanish language terms appeared in regions with large Spanish speaking population.

The geographic topic model that they proposed achieved the strongest performance both on regression and classification accuracy as compared to Supervised LDA or Text Regression methods.

Related Paper

  • Mei et al. (2006) studies the relation between topics of blog post and geographic location

Study Plan

To understand this paper you might want to read