Connections between the Lines: Augmenting Social Networks with Text
This Paper is available online [1].
Summary
This paper proposes a topic model based off of Latent Dirichlet Allocation for Social Networks. The main focus of this paper is to adapt probabilistic topic models to account for relationships between entities. Entities in this paper are collections of discrete data, and in the paper, they only deal with words - so an entity would be a document. In particular, this paper describes a model with a generative process to choose words based on mixtures of topics both for the words and for the relationships between entities. The focus is on network data which can be modeled with a relationship between two entities. Rather than the standard LDA model, a word can be generated from a distribution over topics for the relationship, in addition to the normal method.
Datasets
The authors evaluate their results on three different datasets of text: the Bible, Biological Scientific Abstracts, and Wikipedia. From the bible dataset, characters have been manually labelled as entities. Documents are individual verses. The co-occurrence of two entities (characters) in a verse serve as a link to create the network. Similarly, the biological dataset has genes, proteins, and diseases that have been manually labelled as entities. Documents are individual abstracts, and co-occurrences of entities in an abstract creates the linking structure. Wikipedia was very straightforward, just using pages and the links between them.
All corpora were preprocessed to remove stop words and the porter stemmer was used to stem words. Additionally, infrequent tokens, entities, and pairs were pruned from the dataset.
Methodology
The authors describe the model in terms of a generative story of how each word is generated.
1.) For each entity topic j and relationship topic k,
(a) Draw a topic multinomial from dirichlet distributions
2.) For each entitiy e,
(a) Draw entity topic proportions from a dirichlet (b) For each word associated with this entity's context, i. Draw a topic assignment, z, from the entity topic proportion multinomial ii. Draw a word from the overall topic multinomial
3.) For each pair of entities e and e'
(a) Draw a relationship topic proportion from a dirichlet (b) Draw selector proportions Pi(e,e') ~ Dirichlet (c) For each word associated with this entity pair's context i. Draw a selector, c, from the multinomial Pi(e,e') ii. If c = 1 A. Draw topic assignment z from multinomial(Theta(e)) B. Draw word from topic distribution for entity e iii. If c = 2 A. Draw topic assignment z from multinomial(Theta(e')) B. Draw word from topic distribution for entity e' iv. If c = 3 A. Draw topic assignment z from multinomial(Theta(e,e')) B. Draw word from topic distribution for entity (e, e')
A handy way to think about this if you are familiar with LDA is that 1, 2, and 3 (up to bullet i), are the LDA model. There is an extra switching variable c that has been added to the method to deal with the network data. c can only take on one of three different values. The distribution over topics in an entity relationship is conditioned on whether a word was generated from the first entity, the second entity, or both. Thus, a relationship entity has three different distributions over topics from either entity individually or a joint mixture.
Experimental Results
The authors evaluate their method both quantitatively and qualitatively. Qualitatively, they show the top words learned for in a topic for an entity. Qualitatively, they look at predictive log likelihood for entity prediction and relation prediction. They compare to three alternative models, LDA, Author-Topic Model, and a unigram model.