Difference between revisions of "Connections between the Lines: Augmenting Social Networks with Text"

From Cohen Courses
Jump to navigationJump to search
(Created page with ' This [[Category::Paper]] is available online [http://www.umiacs.umd.edu/~jbg/docs/kdd2009.pdf]. == Summary == This paper proposes a topic model based off of Latent Dirichlet …')
 
Line 1: Line 1:
 
 
This [[Category::Paper]] is available online [http://www.umiacs.umd.edu/~jbg/docs/kdd2009.pdf].
 
This [[Category::Paper]] is available online [http://www.umiacs.umd.edu/~jbg/docs/kdd2009.pdf].
  
Line 9: Line 8:
 
== Datasets ==
 
== Datasets ==
  
The authors evaluate their results on three different datasets of text: the Bible, Biological Scientific Abstracts, and Wikipedia.
+
The authors evaluate their results on three different datasets of text: the Bible, Biological Scientific Abstracts, and Wikipedia. From the bible dataset, characters have been manually labelled as entities. Documents are individual verses. The co-occurrence of two entities (characters) in a verse serve as a link to create the network. Similarly, the biological dataset has genes, proteins, and diseases that have been manually labelled as entities. Documents are individual abstracts, and co-occurrences of entities in an abstract creates the linking structure. Wikipedia was very straightforward, just using pages and the links between them.
 +
 
 +
All corpora were preprocessed to remove stop words and the porter stemmer was used to stem words. Additionally, infrequent tokens, entities, and pairs were pruned from the dataset.
 +
 
 +
== Methodology ==
 +
 
 +
The authors describe the model in terms of a generative story of how each word is generated.
 +
 
 +
1.) For each entity topic j and relationship topic k,
 +
  (a) Draw a topic multinomial from Dirichlet distributions
 +
 
 +
2.)

Revision as of 10:05, 27 September 2012

This Paper is available online [1].


Summary

This paper proposes a topic model based off of Latent Dirichlet Allocation for Social Networks. The main focus of this paper is to adapt probabilistic topic models to account for relationships between entities. Entities in this paper are collections of discrete data, and in the paper, they only deal with words - so an entity would be a document. In particular, this paper describes a model with a generative process to choose words based on mixtures of topics both for the words and for the relationships between entities. The focus is on network data which can be modeled with a relationship between two entities. Rather than the standard LDA model, a word can be generated from a distribution over topics for the relationship, in addition to the normal method.

Datasets

The authors evaluate their results on three different datasets of text: the Bible, Biological Scientific Abstracts, and Wikipedia. From the bible dataset, characters have been manually labelled as entities. Documents are individual verses. The co-occurrence of two entities (characters) in a verse serve as a link to create the network. Similarly, the biological dataset has genes, proteins, and diseases that have been manually labelled as entities. Documents are individual abstracts, and co-occurrences of entities in an abstract creates the linking structure. Wikipedia was very straightforward, just using pages and the links between them.

All corpora were preprocessed to remove stop words and the porter stemmer was used to stem words. Additionally, infrequent tokens, entities, and pairs were pruned from the dataset.

Methodology

The authors describe the model in terms of a generative story of how each word is generated.

1.) For each entity topic j and relationship topic k,

  (a) Draw a topic multinomial from Dirichlet distributions

2.)