|
|
(15 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | == Some Large Datasets That Are Available == | + | == Projects Suggested in Previous Years == |
| | | |
− | Google [http://books.google.com/ngrams/datasets books n-grams data] was used for the phrase-finding project. Some other tasks involving phrases - such as unsupervised learning of sentiment-bearing phrases - were discussed in class. There is also a [http://web-ngram.research.microsoft.com/info/ Microsoft n-gram API] for Web text, which might also be useful for some projects.
| + | * [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012]] |
− | | + | * [[Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014]] |
− | N-gram datasets from Google are often only frequent n-grams. It might be interesting to see how this truncation affects use of these phrases. Project Gutenberg [http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages allows bulk download] of their 70k books, which could be used to create your own n-gram corpus.
| |
− | | |
− | Wikipedia, especially the preprocessed versions available in [http://dbpedia.org/About DBPedia], is a useful source of data and ideas. These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases. Some of the links are manually added - could you design a learning process to propose and evaluate links between Wikipedia and some other database? what sort of other tasks seem interesting to consider?
| |
− | | |
− | The NELL project distributes a number of large [http://rtw.ml.cmu.edu/wk/all-pairs-OC-2011-12-31-big2-gz/ datasets] including noun-phrases and the word context they appear in (extracted from Web text). This data is also available for noun phrase pairs. Additionally there is a collection of subject-verb-object triples from a parsed English text - there's a copy of this, with some documentation, in /afs/cs/project/bigML/nell.
| |
− | | |
− | [http://www.geonames.org/ Geonames] has [http://download.geonames.org/export/dump/ complete dumps] of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc). Some questions you might think about:
| |
− | * Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
| |
− | * Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)? Can you match them with non-geolocated wikipedia pages? | |
− | * Can you use the results of possible matches to wikipedia to do a better job at predicting the type of a geographical location?
| |
− | | |
− | The [http://labrosa.ee.columbia.edu/millionsong/ Million song] dataset has audio data for a lot of songs - actually a million - and pointers to a number of interesting related information, like List.fm ratings and tags for the songs and lyrics. A number of plausible large-scale learning tasks are described there (for instance - classify songs by year of release).
| |
− | | |
− | [http://aws.amazon.com/publicdatasets/#1 Amazon publically hosts] many large datasets. | |
− | | |
− | The [http://zola.di.unipi.it/smalltext/datasets.html AOL query log] data is available from a number of sources - several million queries with click-through data and session information. (This data was released and then unreleased by AOL due to privacy issues - please be sensitive about the privacy implications of the data, especially the session-related data.)
| |
− | | |
− | Jure Leskovic's group at Stanford hosts the [http://snap.stanford.edu/data/ SNAP] repository, which includes many large graph datasets. Some of these are also wikipedia-related (e.g., a graph of which editors edited which wikipedia pages, and which editors have communicated with each other, or related to other datasets listed above (e.g., checkin data from 4-square-like geographically oriented social-network services).
| |
− | | |
− | We have at CMU a couple of large corpora that have been run thru non-trivial NLP pipelines: all of wikipedia, all of Gigaword, and all of ClueWeb (these are courtesey of the Hazy project at Univ Wisconsin). Lots of interesting things can be done with these, including: doing distributional clustering; distant learning for relations defined from Wikipedia info-boxes; finding subcorpus-specific parses (instead of phrases); or seeing if [http://malt.ml.cmu.edu/mw/index.php/Turney_2006_A_Uniform_Approach_to_Analogies,_Synonyms,_Antonyms,_and_Associations,_COLING_2008 distributional methods for analogy and synonym can be improved by using parsed data].
| |
− | | |
− | == Article Selection for English ==
| |
− | | |
− | ''Also mentioned in the lecture on 1-23-2013.''
| |
− | | |
− | In spite of the long-standing simplified assumption that knowledge of
| |
− | language is characterized by a categorical system of grammar, many
| |
− | previous studies have shown that language users also make
| |
− | probabilistic syntactic choices. Previous work towards modeling the
| |
− | probabilistic aspect of human language skills using
| |
− | probabilistic-based models has shown promising results. However, the
| |
− | proposed model relies on a set of strong features. Given the large
| |
− | amount of data available these days, constructing a
| |
− | probabilistic-based model without the reliance on strong features
| |
− | becomes possible.
| |
− | | |
− | Article selection in English is one of the most challenging learning
| |
− | tasks for second language learners. The full set of grammar rules in
| |
− | choosing the correct articles has more than 60 rules in it. In this
| |
− | project, you can build a large-scale probabilistic model on article
| |
− | selection. For example, you can keep a sliding window around the
| |
− | article as the features, and train a classifier on the article
| |
− | selection task. The interesting to ask is that although the associated
| |
− | grammar rules are non-probabilistic, can the large amount of data
| |
− | assist a probabilistic model to capture such non-probabilistic rules?
| |
− | | |
− | Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article)
| |
− | or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples
| |
− | labeled as to the "rule" they correspond to, and student performance data, are also available. [http://www.cs.cmu.edu/~nli1/ Nan Li]
| |
− | is interested in working with 10-605 students on this project.
| |
− | | |
− | == Network Analysis of Congressional Tweets ==
| |
− | | |
− | Tae Yano (taey@cs.cmu.edu) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics. She has collected a large corpus of tweets from and between members of congress - described here: [[File:README_twt_cgrs_data.txt]. There are a number of interesting things that could be done with this, such as:
| |
− | * Link prediction - predict when new links are formed in the network
| |
− | * Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
| |
− | * ... what can you think of?
| |