Rbosaghz project progress

From Cohen Courses
Jump to navigationJump to search
  • What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data?

I scraped the full set of grants available at http://www.nsf.gov/awardsearch/, and for each distinct Principal Investigator who got a grant after 2006, extracted the institution associated with the grant. This institution can be trusted to be the current institution of the Principle Investigator.

Some interesting information about the Principle Investigators:

There are 1361 in total, and the histogram for the top 30 institutions is:

   Organization
   countof
Carnegie-Mellon University 85
University of Illinois at Urbana-Champaign 55
Massachusetts Institute of Technology 45
University of California-Berkeley 39
University of Washington 35
Stanford University 32
University of Texas at Austin 30
GA Tech Research Corporation - GA Institute of Technology 30
University of California-San Diego 27
University of Maryland College Park 27

Which means that CMU had significantly more researchers with NSF grants than all the other big CS schools, in the past 3 years, which is a nice to know!

The data is a list of pairs of the form (researcher name, researcher organization), which is exactly what I need. I also have their title (Associate/Assistant Professor, etc), but that will be for extension of the project.

  • Do you plan on looking at the same problem, or have you changed your plans?

I still plan on looking at the same problem. Namely, given only a researcher's name, I want to find their institution using web data, especially Google results. First I will try to find their webpage, then from the webpage find their current institution.

  • If you plan on writing code, what have you written so far, in what languages, and what do you still need to do?

The project will be done almost entirely in C++. I have written the code ready for fetching Google results, and am the stage where I need to look at the list of results and decide which is the homepage.

  • If you plan on using off-the-shelf code, what have you installed, what experiences have you had with it?

I plan on using CRF++ (http://crfpp.sourceforge.net/) for the labeling of the webpage text. I've used it in the past without much hassle.

  • If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

No baseline yet.