Difference between revisions of "Ideas for course projects for 10-605"

From Cohen Courses
Jump to navigationJump to search
(Replaced content with "== Projects Suggested in Previous Years == * Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012 * Projects_for_Machine_Learning_with_Large_Dat...")
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Some Large Datasets That Are Available ==
+
== Projects Suggested in Previous Years ==
  
Google [http://books.google.com/ngrams/datasets books n-grams data] was used for the phrase-finding project.  Some other tasks involving phrases - such as unsupervised learning of sentiment-bearing phrases - were discussed in class.  There is also a [http://web-ngram.research.microsoft.com/info/ Microsoft n-gram API] for Web text, which might also be useful for some projects.
+
* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]]
 
+
* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]]
N-gram datasets from Google are often only frequent n-grams.  It might be interesting to see how this truncation affects use of these phrases.  Project Gutenberg [http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages allows bulk download] of their 70k books, which could be used to create your own n-gram corpus.
 
 
 
Wikipedia, especially the preprocessed versions available in [http://dbpedia.org/About DBPedia], is a useful source of data and ideas.  These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases.  Some of the links are manually added - could you design a learning process to propose and evaluate links between Wikipedia and some other database? what sort of other tasks seem interesting to consider?
 
 
 
The NELL project distributes a number of large [http://rtw.ml.cmu.edu/wk/all-pairs-OC-2011-12-31-big2-gz/ datasets] including noun-phrases and the word context they appear in (extracted from Web text).  This data is also available for noun phrase pairs.  Additionally there is a collection of subject-verb-object triples from a parsed English text - there's a copy of this, with some documentation, in /afs/cs/project/bigML/nell.
 
 
 
[http://www.geonames.org/ Geonames] has [http://download.geonames.org/export/dump/ complete dumps] of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc).  Some questions you might think about:
 
* Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
 
* Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)?  Can you match them with non-geolocated wikipedia pages?
 
* Can you use the results of possible matches to wikipedia to do a better job at predicting the type of a geographical location?
 
 
 
The [http://labrosa.ee.columbia.edu/millionsong/ Million song] dataset has audio data for a lot of songs - actually a million - and pointers to a number of interesting related information, like List.fm ratings and tags for the songs and lyrics.  A number of plausible large-scale learning tasks are described there (for instance - classify songs by year of release).
 
 
 
[http://aws.amazon.com/publicdatasets/#1 Amazon publically hosts] many large datasets.
 
 
 
The [http://zola.di.unipi.it/smalltext/datasets.html AOL query log] data is available from a number of sources - several million queries with click-through data and session information. (This data was released and then unreleased by AOL due to privacy issues - please be sensitive about the privacy implications of the data, especially the session-related data.)
 
 
 
Jure Leskovic's group at Stanford hosts the [http://snap.stanford.edu/data/ SNAP] repository, which includes many large graph datasets.  Some of these are also wikipedia-related (e.g., a graph of which editors edited which wikipedia pages, and which editors have communicated with each other, or related to other datasets listed above (e.g., checkin data from 4-square-like geographically oriented social-network services).
 
 
 
We have at CMU a couple of large corpora that have been run thru non-trivial NLP pipelines: all of wikipedia, all of Gigaword, and all of ClueWeb (these are courtesey of the Hazy project at Univ Wisconsin).  Lots of interesting things can be done with these, including: doing distributional clustering; distant learning for relations defined from Wikipedia info-boxes; finding subcorpus-specific parses (instead of phrases); or seeing if [http://malt.ml.cmu.edu/mw/index.php/Turney_2006_A_Uniform_Approach_to_Analogies,_Synonyms,_Antonyms,_and_Associations,_COLING_2008 distributional methods for analogy and synonym can be improved by using parsed data].
 
 
 
== Article Selection for English ==
 
 
 
''Also mentioned in the lecture on 1-23-2013.''
 
 
 
In spite of the long-standing simplified assumption that knowledge of
 
language is characterized by a categorical system of grammar, many
 
previous studies have shown that language users also make
 
probabilistic syntactic choices. Previous work towards modeling the
 
probabilistic aspect of human language skills using
 
probabilistic-based models has shown promising results. However, the
 
proposed model relies on a set of strong features. Given the large
 
amount of data available these days, constructing a
 
probabilistic-based model without the reliance on strong features
 
becomes possible.
 
 
 
Article selection in English is one of the most challenging learning
 
tasks for second language learners. The full set of grammar rules in
 
choosing the correct articles has more than 60 rules in it. In this
 
project, you can build a large-scale probabilistic model on article
 
selection. For example, you can keep a sliding window around the
 
article as the features, and train a classifier on the article
 
selection task. The interesting to ask is that although the associated
 
grammar rules are non-probabilistic, can the large amount of data
 
assist a probabilistic model to capture such non-probabilistic rules?
 
 
 
Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
 
or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
 
labeled as to the "rule" they correspond to, and student performance data, are also available.  [http://www.cs.cmu.edu/~nli1/ Nan Li]
 
is interested in working with 10-605 students on this project.
 
 
 
== Network Analysis of Congressional Tweets ==
 
 
 
Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics.  She has collected a large corpus of tweets from and between members of congress - described here: [[File:README_twt_cgrs_data.txt].  There are a number of interesting things that could be done with this, such as:
 
* Link prediction - predict when new links are formed in the network
 
* Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
 
* ... what can you think of?
 

Latest revision as of 12:27, 9 February 2015