Difference between revisions of "Ideas for course projects for 10-605"

@@ Line 1: / Line 1: @@
-== Article Selection for English ==
+==  Projects Suggested in Previous Years ==
-''Also mentioned in the lecture on 1-23-2013.''
+* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]]
+* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]]
-In spite of the long-standing simplified assumption that knowledge of
-language is characterized by a categorical system of grammar, many
-previous studies have shown that language users also make
-probabilistic syntactic choices. Previous work towards modeling the
-probabilistic aspect of human language skills using
-probabilistic-based models has shown promising results. However, the
-proposed model relies on a set of strong features. Given the large
-amount of data available these days, constructing a
-probabilistic-based model without the reliance on strong features
-becomes possible.
-Article selection in English is one of the most challenging learning
-tasks for second language learners. The full set of grammar rules in
-choosing the correct articles has more than 60 rules in it. In this
-project, you can build a large-scale probabilistic model on article
-selection. For example, you can keep a sliding window around the
-article as the features, and train a classifier on the article
-selection task. The interesting to ask is that although the associated
-grammar rules are non-probabilistic, can the large amount of data
-assist a probabilistic model to capture such non-probabilistic rules?
-Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
-or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
-labeled as to the "rule" they correspond to, and student performance data, are also available.  [http://www.cs.cmu.edu/~nli1/ Nan Li]
-is interested in working with 10-605 students on this project.
-== Network Analysis of Congressional Tweets ==
-Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics.  She has collected a large corpus of tweets from and between members of congress - described [[File:README_twt_cgrs_data.txt|in this file]].  There are a number of interesting things that could be done with this, such as:
-* Link prediction - predict when new links are formed in the network
-* Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
-* ... what can you think of?

Difference between revisions of "Ideas for course projects for 10-605"

Latest revision as of 12:27, 9 February 2015

Projects Suggested in Previous Years

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools