Difference between revisions of "Ideas for course projects for 10-605"

From Cohen Courses
Jump to navigationJump to search
(Replaced content with "== Projects Suggested in Previous Years == * Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012 * Projects_for_Machine_Learning_with_Large_Dat...")
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Article Selection for English ==  
+
== Projects Suggested in Previous Years ==
  
''Also mentioned in the lecture on 1-23-2013.''
+
* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]]
 
+
* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]]
In spite of the long-standing simplified assumption that knowledge of
 
language is characterized by a categorical system of grammar, many
 
previous studies have shown that language users also make
 
probabilistic syntactic choices. Previous work towards modeling the
 
probabilistic aspect of human language skills using
 
probabilistic-based models has shown promising results. However, the
 
proposed model relies on a set of strong features. Given the large
 
amount of data available these days, constructing a
 
probabilistic-based model without the reliance on strong features
 
becomes possible.
 
 
 
Article selection in English is one of the most challenging learning
 
tasks for second language learners. The full set of grammar rules in
 
choosing the correct articles has more than 60 rules in it. In this
 
project, you can build a large-scale probabilistic model on article
 
selection. For example, you can keep a sliding window around the
 
article as the features, and train a classifier on the article
 
selection task. The interesting to ask is that although the associated
 
grammar rules are non-probabilistic, can the large amount of data
 
assist a probabilistic model to capture such non-probabilistic rules?
 
 
 
Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
 
or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
 
labeled as to the "rule" they correspond to, and student performance data, are also available.  [http://www.cs.cmu.edu/~nli1/ Nan Li]
 
is interested in working with 10-605 students on this project.
 
 
 
== Network Analysis of Congressional Tweets ==
 
 
 
Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics.  She has collected a large corpus of tweets from and between members of congress - described [[File:README_twt_cgrs_data.txt|in this file]].  There are a number of interesting things that could be done with this, such as:
 
* Link prediction - predict when new links are formed in the network
 
* Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
 
* ... what can you think of?
 

Latest revision as of 12:27, 9 February 2015