The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email

From Cohen Courses
Jump to navigationJump to search

Citation

McCallum, A., Corrada-Emmanuel, A., and Wang, X. The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, 2004. Technical Report UM-CS-2004-096.

Online version

The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email

Summary

Consider the problem of modeling a company's email network. Let's say Michael is a boss, Pam is his assistant, and the two both mail similar people. If we only consider the network structure of this email network, both Michael and Pam would be assigned with similar roles. Their roles as a boss and an assistant only becomes clear when we consider the language content of the emails that the two send out.

This paper builds on this idea, combining language content/topic in traditional social network analysis (where only network structure was considered). The authors extend upon the Author-Topic model, in which a topic distribution (distribution over words) exists for each author. Instead, in the Author-Recipient-Topic model (which is presented in this paper), their is a topic distribution for each author-recipient pair.

We can marginalize the author or recipient in order to see the topics a person would be likely to send or receive. This person-conditioned topic distribution can be used to calculate similarity between people.

A comparison of the Author-Topic model and the Author-Recipient-Topic model is shown below. Note that in the AR model, there is a separate topic-distribution, , for each author , whereas in the ART model, there is a for each pair of author and recipient .

Art model comparison.png


Results

The authors conduct a qualitative analysis of the Author-Recipient-Topic model on the Enron email corpus and the McCallum email corpus, comparing it against Author-Topic model and a stochastic block model (SNA). In general, the authors posit that the ART model is more appropriate than the AT model and SNA model.

For example, the table below shows the most similar pairs calculated for the McCallum email corpus. In general, the predictions of the ART model look reasonable while the pairs predicted by the SNA model does not look so well.

Art model.png

Discussion

This model is limited in the sense that we need to recalculate the model for the entire network whenever we see new nodes/edges.

Related papers

Study plan

Much of this paper is self-explanatory, assuming that the reader is familiar with topic models in general.