Proposal Draft Yanbo Ming

From Cohen Courses
Jump to navigationJump to search

Team Member

Yanbo Xu (yanbox)

Ming Sun (mings)

Overview

A temporal idea of this project is to investigate how the research topics are shifting among the faculties at LTI from the last ten or fifteen years. Specifically, it reveals: 1)what research areas a faculty has been interested at and more importantly how his interest changes over time; 2) what incremental cooperation are there between the faculties and therefore to construct an academia social network by people's sharing research interest. A more sophisticated goal would be to predict what future research areas a professor will step foot on or what new co-author relationship will be established in the future.

Dataset

  • A "low hanging fruit" dataset is ACL papers(2000-2008)
  • Web crawling data of LTI faculties' publications.

An Outline of Possible Tasks

Task1: Web crawling and data preprocessing

  • Crawl lti.cs.cmu.edu and get all the faculties' homepages;
  • Extract .pdf files of their publications in the past 15 years;
  • Segment .pdf papers into titles, abstracts, bodies and references;
  • Author name disambiguation
  • Remove stop words, data formatting ...

Task2: Online Author Topic Modeling

  • HDP has been applied in topic modeling to automatically decide the number of topics in a collection of texts. The basic idea of Chinese Restaurant Process(CRP) can be also used to find the new topics and reunion old topics from input of a sequential set of texts. On the other hand, Author Topic Model(ATM) explicitly relate the topics to the document authors. So an online inference by integrating the CRP into ATM should be table to detect incremental topics and topics shifting trend associated with not only the entire collection but also one specific author.
  • More thoughts are needed to set up a supervised approach to predict future topics.

Task3: Visualization

  • Visualize the outputs from Task2, e.g what the changing trend of the research topics looks like during the past decade in LTI, or who have dominated a particular research area in the past while who else come in and start to compete with them now.
  • Visualize the how the social network changes based on faculties' shifting research interests. (or coauthor relationship?)