Difference between revisions of "Ideas for course projects for 10-605"

@@ Line 1: / Line 1: @@
-== Some Large Datasets That Are Available ==
+==  Projects Suggested in Previous Years ==
-Google [http://books.google.com/ngrams/datasets books n-grams data] was used for the phrase-finding project.  Some other tasks involving phrases - such as unsupervised learning of sentiment-bearing phrases - were discussed in class.  There is also a [http://web-ngram.research.microsoft.com/info/ Microsoft n-gram API] for Web text, which might also be useful for some projects.
+* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]]
+* [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]]
-N-gram datasets from Google are often only frequent n-grams.  It might be interesting to see how this truncation affects use of these phrases.  Project Gutenberg [http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages allows bulk download] of their 70k books, which could be used to create your own n-gram corpus.
-Wikipedia, especially the preprocessed versions available in [http://dbpedia.org/About DBPedia], is a useful source of data and ideas.  These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases.  Some of the links are manually added - could you design a learning process to propose and evaluate links between Wikipedia and some other database? what sort of other tasks seem interesting to consider?
-The NELL project distributes a number of large [http://rtw.ml.cmu.edu/wk/all-pairs-OC-2011-12-31-big2-gz/ datasets] including noun-phrases and the word context they appear in (extracted from Web text).  This data is also available for noun phrase pairs.  Additionally there is a collection of subject-verb-object triples from a parsed English text - there's a copy of this, with some documentation, in /afs/cs/project/bigML/nell.
-[http://www.geonames.org/ Geonames] has [http://download.geonames.org/export/dump/ complete dumps] of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc).  Some questions you might think about:
-* Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
-* Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)?  Can you match them with non-geolocated wikipedia pages?
-* Can you use the results of possible matches to wikipedia to do a better job at predicting the type of a geographical location?
-The [http://labrosa.ee.columbia.edu/millionsong/ Million song] dataset has audio data for a lot of songs - actually a million - and pointers to a number of interesting related information, like List.fm ratings and tags for the songs and lyrics.  A number of plausible large-scale learning tasks are described there (for instance - classify songs by year of release).
-[http://aws.amazon.com/publicdatasets/#1 Amazon publically hosts] many large datasets.
-The [http://zola.di.unipi.it/smalltext/datasets.html AOL query log] data is available from a number of sources - several million queries with click-through data and session information. (This data was released and then unreleased by AOL due to privacy issues - please be sensitive about the privacy implications of the data, especially the session-related data.)
-Jure Leskovic's group at Stanford hosts the [http://snap.stanford.edu/data/ SNAP] repository, which includes many large graph datasets.  Some of these are also wikipedia-related (e.g., a graph of which editors edited which wikipedia pages, and which editors have communicated with each other, or related to other datasets listed above (e.g., checkin data from 4-square-like geographically oriented social-network services).
-We have at CMU a couple of large corpora that have been run thru non-trivial NLP pipelines: all of wikipedia, all of Gigaword, and all of ClueWeb (these are courtesey of the Hazy project at Univ Wisconsin).  Lots of interesting things can be done with these, including: doing distributional clustering; distant learning for relations defined from Wikipedia info-boxes; finding subcorpus-specific parses (instead of phrases); or seeing if [http://malt.ml.cmu.edu/mw/index.php/Turney_2006_A_Uniform_Approach_to_Analogies,_Synonyms,_Antonyms,_and_Associations,_COLING_2008 distributional methods for analogy and synonym can be improved by using parsed data].
-== Article Selection for English ==
-''Also mentioned in the lecture on 1-23-2013.''
-In spite of the long-standing simplified assumption that knowledge of
-language is characterized by a categorical system of grammar, many
-previous studies have shown that language users also make
-probabilistic syntactic choices. Previous work towards modeling the
-probabilistic aspect of human language skills using
-probabilistic-based models has shown promising results. However, the
-proposed model relies on a set of strong features. Given the large
-amount of data available these days, constructing a
-probabilistic-based model without the reliance on strong features
-becomes possible.
-Article selection in English is one of the most challenging learning
-tasks for second language learners. The full set of grammar rules in
-choosing the correct articles has more than 60 rules in it. In this
-project, you can build a large-scale probabilistic model on article
-selection. For example, you can keep a sliding window around the
-article as the features, and train a classifier on the article
-selection task. The interesting to ask is that although the associated
-grammar rules are non-probabilistic, can the large amount of data
-assist a probabilistic model to capture such non-probabilistic rules?
-Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
-or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
-labeled as to the "rule" they correspond to, and student performance data, are also available.  [http://www.cs.cmu.edu/~nli1/ Nan Li]
-is interested in working with 10-605 students on this project.
-== Network Analysis of Congressional Tweets ==
-Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics.  She has collected a large corpus of tweets from and between members of congress - described here: [[File:README_twt_cgrs_data.txt].  There are a number of interesting things that could be done with this, such as:
-* Link prediction - predict when new links are formed in the network
-* Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
-* ... what can you think of?

Difference between revisions of "Ideas for course projects for 10-605"

Latest revision as of 12:27, 9 February 2015

Projects Suggested in Previous Years

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools