|
|
(17 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | == Article Selection for English == | + | == Projects Suggested in Previous Years == |
| | | |
− | ''Also mentioned in the lecture on 1-23-2013.''
| + | * [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]] |
− | | + | * [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]] |
− | In spite of the long-standing simplified assumption that knowledge of
| |
− | language is characterized by a categorical system of grammar, many
| |
− | previous studies have shown that language users also make
| |
− | probabilistic syntactic choices. Previous work towards modeling the
| |
− | probabilistic aspect of human language skills using
| |
− | probabilistic-based models has shown promising results. However, the
| |
− | proposed model relies on a set of strong features. Given the large
| |
− | amount of data available these days, constructing a
| |
− | probabilistic-based model without the reliance on strong features
| |
− | becomes possible.
| |
− | | |
− | Article selection in English is one of the most challenging learning
| |
− | tasks for second language learners. The full set of grammar rules in
| |
− | choosing the correct articles has more than 60 rules in it. In this
| |
− | project, you can build a large-scale probabilistic model on article
| |
− | selection. For example, you can keep a sliding window around the
| |
− | article as the features, and train a classifier on the article
| |
− | selection task. The interesting to ask is that although the associated
| |
− | grammar rules are non-probabilistic, can the large amount of data
| |
− | assist a probabilistic model to capture such non-probabilistic rules?
| |
− | | |
− | Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
| |
− | or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
| |
− | labeled as to the "rule" they correspond to, and student performance data, are also available. [http://www.cs.cmu.edu/~nli1/ Nan Li]
| |
− | is interested in working with 10-605 students on this project.
| |
− | | |
− | == Network Analysis of Congressional Tweets ==
| |
− | | |
− | Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics. She has collected a large corpus of tweets from and between members of congress - described [[File:README_twt_cgrs_data.txt|in this file]]. There are a number of interesting things that could be done with this, such as:
| |
− | * Link prediction - predict when new links are formed in the network
| |
− | * Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
| |
− | * ... what can you think of?
| |