Ideas for course projects for 10-605

From Cohen Courses
Revision as of 11:57, 23 January 2013 by Wcohen (talk | contribs) (Created page with '== Article Selection for English == ''Also mentioned in the lecture on 1-23-2013.'' In spite of the long-standing simplified assumption that knowledge of language is character…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Article Selection for English

Also mentioned in the lecture on 1-23-2013.

In spite of the long-standing simplified assumption that knowledge of language is characterized by a categorical system of grammar, many previous studies have shown that language users also make probabilistic syntactic choices. Previous work towards modeling the probabilistic aspect of human language skills using probabilistic-based models has shown promising results. However, the proposed model relies on a set of strong features. Given the large amount of data available these days, constructing a probabilistic-based model without the reliance on strong features becomes possible.

Article selection in English is one of the most challenging learning tasks for second language learners. The full set of grammar rules in choosing the correct articles has more than 60 rules in it. In this project, you can build a large-scale probabilistic model on article selection. For example, you can keep a sliding window around the article as the features, and train a classifier on the article selection task. The interesting to ask is that although the associated grammar rules are non-probabilistic, can the large amount of data assist a probabilistic model to capture such non-probabilistic rules?

Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article) or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples labeled as to the "rule" they correspond to, and student performance data, are also available. Nan Li is interested in working with 10-605 students on this project.