Proposal 2nd Draft Nitin Yandong Ming Yanbo
Contents
Modeling Academic Collaboration and Influence in scholarly literature
Team members
Nitin Agarwal
The Problem
New research papers are growing rapidly, especially in computer science field, making it hard to follow. Instead of wasting time reading all the papers, we want our computers to answer following questions:
- Who to collaborate with?
- Which work to cite?
- Who to review this paper (for conference organizers)?
Essentially, we d like to capture the interactions and relationships between people. For academia, it s mainly about collaboration and citation. There are approaches about content Analysis and/or connectivity Analysis.
Related work
Author Topic Model
Author-Topic model describes such a generative process about how each document is generated:
For each document:
- Choose an author
- Choose a topic
- Choose a word
The result obtained includes the topic distribution per each author, and word distribution per each topic. One possible application suggested by this paper is to find related authors by computing KL-divergence of different author's topic distribution.
Author-Recipient-Topic Model
For this model authors believe that nodes have different roles like in email data there are senders and receivers and they should be treated differently in the model. Therefore instead of modeling individuals, we model the pair relationship directly. An author and a set of recipients are observed. Topics are now conditioned on (author, recipient) pair.
As we can see a lot of previous work was either based on content analysis, or graph connectivity analysis. There is tremendously rich information hidden in the text so we'll go with topic model. We will derive a hybrid model that utilizes knowledge of both kinds. Similarly, we model the pair relationship directly, such as (author, author) or (author, citation)
Collaboration Influence Model
This proposed model is influenced by earlier work in topic models which tend to uncover the social structure in text and discover latent topics conditioned on it. Some of the important characteristics of this topic model are:
- A biased Bernoulli flip which favors collaboration over influence
- Dirichlet priors over topics
- Each word sampled has an author-pair label and a relation label
- The relation label specifies whether the word resulted due to collaboration or influence
The aspects of a corpus of networked scientific article that the collaboration influence model tries to capture are author network and collaboration vs influence relations.
Dataset
We would be working with the ACL Anthology 2008 (Radev et al.) dataset. Some important statistics of the dataset are :
- Contains 13, 739 papers from computational linguistics conference
- 10,409 nodes in author citation network
- 195,504 edges in author citation network
- 10,409 nodes in author collaboration network
- 57,614 edges in author collaboration network
Application
By analyzing this collaboration and influence network, we can solve the following problems:
Who to collaborate with?
- Given a professor's name and his/her research topic, we want the computer to list the most possible researchers for him/her to collaborate.
- This can be stated as
- Here, means the author of a paper. could be either a co-author or citee. This role depends on the parameter . We use 'co-author' and 'influence' to represent the co-authorship and citation. The research topic is denoted as .
Who to cite?
- Given a research topic, we want the computer to recommend a list of most influential author(s) in this area.
- This can be stated as .
How does the citation network look like?
- Given a research topic, how to define the link between two authors?
- In another word, we would like to know: .
- Through the construction of links among authors, we can build up a network of a given topic.
- We can also observe or visualize the evolution of this network by adding temporal information.
- We can generalize this problem to the research collaboration and influence network without any specific topic by calculating
Temporal Analysis
- Filter out LTI faculties papers
- Apply the model to analyze the historical research trends at LTI
- Train the model on the entire collection
- Find out the total number of topic clusters
- Apply the model on each year s data
- Use KL-divergence to map each topic to the topic clusters
- Examine the strength of each topic over time at LTI
- Examine the research trend of a LTI faculty
- Analyze the academic collaboration and influence network in LTI
- Given a research interest, visualize how the collaboration and influence network changes over time:
- Without specifying a research area, visualize how the LTI faculties collaboration changes and what the influential trend looks like during the past decade:
Software
This is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. We are going to modify this package and implement our proposed Collaboration Influence Model.
Reference
[1] ROSEN-ZVI, M., GRIFFITHS, T., STEYVERS, M. and SMITH, P. (2004). The author-topic model for authors and documents. In AUAI’04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence 487–494. AUAI Press, Arlington, VA.
[2] MCCALLUM, A., CORRADA-EMMANUEL, A. andWANG, X. (2004). The author–recipient–topic model for topic and role discovery in social networks: Experiments with Enron and academic email. Technical report, Univ. Massachusetts, Amherst.