Ideas for course projects for 10-605

From Cohen Courses
Jump to navigationJump to search

Article Selection for English

Also mentioned in the lecture on 1-23-2013.

In spite of the long-standing simplified assumption that knowledge of language is characterized by a categorical system of grammar, many previous studies have shown that language users also make probabilistic syntactic choices. Previous work towards modeling the probabilistic aspect of human language skills using probabilistic-based models has shown promising results. However, the proposed model relies on a set of strong features. Given the large amount of data available these days, constructing a probabilistic-based model without the reliance on strong features becomes possible.

Article selection in English is one of the most challenging learning tasks for second language learners. The full set of grammar rules in choosing the correct articles has more than 60 rules in it. In this project, you can build a large-scale probabilistic model on article selection. For example, you can keep a sliding window around the article as the features, and train a classifier on the article selection task. The interesting to ask is that although the associated grammar rules are non-probabilistic, can the large amount of data assist a probabilistic model to capture such non-probabilistic rules?

Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article) or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples labeled as to the "rule" they correspond to, and student performance data, are also available. Nan Li is interested in working with 10-605 students on this project.

Network Analysis of Congressional Tweets

Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics. She has collected a large corpus of tweets from and between members of congress - described File:README twt cgrs data.txt. There are a number of interesting things that could be done with this, such as:

  • Link prediction - predict when new links are formed in the network
  • Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
  • ... what can you think of?