Ideas for course projects for 10-605
Article Selection for English
Also mentioned in the lecture on 1-23-2013.
In spite of the long-standing simplified assumption that knowledge of language is characterized by a categorical system of grammar, many previous studies have shown that language users also make probabilistic syntactic choices. Previous work towards modeling the probabilistic aspect of human language skills using probabilistic-based models has shown promising results. However, the proposed model relies on a set of strong features. Given the large amount of data available these days, constructing a probabilistic-based model without the reliance on strong features becomes possible.
Article selection in English is one of the most challenging learning tasks for second language learners. The full set of grammar rules in choosing the correct articles has more than 60 rules in it. In this project, you can build a large-scale probabilistic model on article selection. For example, you can keep a sliding window around the article as the features, and train a classifier on the article selection task. The interesting to ask is that although the associated grammar rules are non-probabilistic, can the large amount of data assist a probabilistic model to capture such non-probabilistic rules?
Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article) or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples labeled as to the "rule" they correspond to, and student performance data, are also available. Nan Li is interested in working with 10-605 students on this project.
Network Analysis of Congressional Tweets
Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics. She has collected a large corpus of tweets from and between members of congress - described below. There are a number of interesting things that could be done with this, such as:
- Link prediction - predict when new links are formed in the network
- Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
- ... what can you think of?
Congressional Tweet data set
Feb/27/2013
* 2,813,417 tweet messages (as of Feb/27/2013, glowing) corrected through Twitter search API (starting data May/2012), with the following search criteria:
- Message sent by a official tweet account owned by a member of 112th congress ("MC account"). Currently our MC account list include 440 handles. - Message sent to (one or more) MC account(s) - Message which mentions MC's tweet handle in the tweet text.
* Currently all the messages are stored in sql database. One message includes the following data:
mysql> DESCRIBE tweets; +--------------+---------------+------+-----+---------+-------+ |Field | Type | Null | Key | Default | Extra | +--------------+---------------+------+-----+---------+-------+ | id | varchar(50) | NO | PRI | NULL | | | created_at | datetime | YES | | NULL | | | from_user | varchar(50) | YES | | NULL | | | from_user_id | varchar(50) | YES | | NULL | | | to_user | varchar(50) | YES | | NULL | | | to_user_id | varchar(50) | YES | | NULL | | | lat | float | YES | | NULL | | | lon | float | YES | | NULL | | | source | varchar(200) | YES | | NULL | | | text | varchar(200) | YES | | NULL | | | retweeted | tinyint(1) | NO | | 0 | | | retweet_id | varchar(50) | YES | | NULL | | | status_json | varchar(4000) | YES | | NULL | | | user_id | varchar(50) | NO | MUL | NULL | | +--------------+---------------+------+-----+---------+-------+
NOTE: "status_json" is the original json format NOTE: lat and lon is the geo-code (latitude and longitude). Not all the message have them. NOTE: For the precise definition of the fields, please check Tweet API documentation.
* I am also collecting statistics about how often each message is re-tweeted. This part is not completed yet.