Rbosaghz project abstract

From Cohen Courses
Jump to navigationJump to search

Mid-term Report

See Progress report

Motivation

Given a listing of NSF grants, one can predict the success of a grant using various measures (say, the number of publications produced by the grant). As a feature, we would like to use the indicator variable which fires when a Co-Principal Investigator (Co-PI) involved in the grant moves institutions during the life of the grant. It is for this reason that the current institution for each PI is requested.

Introduction

The problem is simple to state, but hard to solve: given a researcher's full name, find his/her current affiliated institution from web results. The recency of the result is particularly important, as explained in the motivation section.

I will be working on this project without collaborators.

Problem Breakdown

This problem can be broken down into two major tasks:

  • Given google query results for a researcher's name, identify the homepage of the researcher
  • Once the homepage has been found, extract the institution name from the homepage contents

The two problems can be solved individually:

  • To determine which google result is the personal homepage, currently I plan to use heuristics which look at the url, but this will likely change into a more complicated model once I see how well heuristics work. Example heuristic: "url's domain is .edu, and has a tilde (~)"
  • To extract the current institution given the homepage, I plan on training a Conditional Random Field (CRF) using the supervised data collected as described in the Training Data section.

Training Data

Supervised training data can be obtained in this way:

  • Crawl NSF award search and obtain a list of current NSF grants, awarded in the past 2 years.
  • Treat the institution affiliated with each grant as the Principal Investigator's current institution.
  • Repeat this over many grants to get a dataset of (Researcher, Institution) pairs, to be used in supervised training.

Given that I have super ninja skills crawling websites, I don't expect this data gathering to be difficult.

Evaluation

Given supervised data in the form of (Researcher, Institution) pairs, I can report accuracy.

Potential Extensions

Given the extra information available about a researcher at NSF award search, it could be desirable to extract other pieces of information from the researcher's homepage. For example, their title: "{Assistant, Associate, Full} Professor". Link title