Projects for Machine Learning with Large Datasets 10-605 in Spring 2014
Large Datasets That Are Available
Kaggle and the KDD Cup (a series of competitions) include a number of large datasets with very specific goals.
Google books n-grams data was used for the phrase-finding project. Some other tasks involving phrases - such as unsupervised learning of sentiment-bearing phrases - were discussed in class. There is also a Microsoft n-gram API for Web text, which might also be useful for some projects.
N-gram datasets from Google are often only frequent n-grams. It might be interesting to see how this truncation affects use of these phrases. Project Gutenberg allows bulk download of their 70k books, which could be used to create your own n-gram corpus.
Wikipedia, especially the preprocessed versions available in DBPedia, is a useful source of data and ideas. These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases. Some of the links are manually added - could you design a learning process to propose and evaluate links between Wikipedia and some other database? what sort of other tasks seem interesting to consider?
The NELL project distributes a number of large datasets including noun-phrases and the word context they appear in (extracted from Web text). This data is also available for noun phrase pairs. Additionally there is a collection of subject-verb-object triples from a parsed English text - there's a copy of this, with some documentation, in /afs/cs/project/bigML/nell.
Geonames has complete dumps of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc). Some questions you might think about:
- Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
- Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)? Can you match them with non-geolocated wikipedia pages?
- Can you use the results of possible matches to wikipedia to do a better job at predicting the type of a geographical location?
The Million song dataset has audio data for a lot of songs - actually a million - and pointers to a number of interesting related information, like List.fm ratings and tags for the songs and lyrics. A number of plausible large-scale learning tasks are described there (for instance - classify songs by year of release).
Amazon publically hosts many large datasets.
The AOL query log data is available from a number of sources - several million queries with click-through data and session information. (This data was released and then unreleased by AOL due to privacy issues - please be sensitive about the privacy implications of the data, especially the session-related data.)
Jure Leskovic's group at Stanford hosts the SNAP repository, which includes many large graph datasets. Some of these are also wikipedia-related (e.g., a graph of which editors edited which wikipedia pages, and which editors have communicated with each other, or related to other datasets listed above (e.g., checkin data from 4-square-like geographically oriented social-network services).
We have at CMU a couple of large corpora that have been run thru non-trivial NLP pipelines: all of wikipedia, all of Gigaword, and all of ClueWeb (these are courtesey of the Hazy project at Univ Wisconsin). Lots of interesting things can be done with these, including: doing distributional clustering; distant learning for relations defined from Wikipedia info-boxes; finding subcorpus-specific parses (instead of phrases); or seeing if distributional methods for analogy and synonym can be improved by using parsed data.
Some New Project Ideas
Article Selection for English
In spite of the long-standing simplified assumption that knowledge of language is characterized by a categorical system of grammar, many previous studies have shown that language users also make probabilistic syntactic choices. Previous work towards modeling the probabilistic aspect of human language skills using probabilistic-based models has shown promising results. However, the proposed model relies on a set of strong features. Given the large amount of data available these days, constructing a probabilistic-based model without the reliance on strong features becomes possible.
Article selection in English is one of the most challenging learning tasks for second language learners. The full set of grammar rules in choosing the correct articles has more than 60 rules in it. In this project, you can build a large-scale probabilistic model on article selection. For example, you can keep a sliding window around the article as the features, and train a classifier on the article selection task. The interesting to ask is that although the associated grammar rules are non-probabilistic, can the large amount of data assist a probabilistic model to capture such non-probabilistic rules?
Relevant datasets might include Google n-grams (if you decide to use just a window of words on either side of the article) or the Hazy dataset (pre-parsed copies of Wikipedia and the ClueWeb2009 corpus) if you use parsed text; much smaller sets of examples labeled as to the "rule" they correspond to, and student performance data, are also available. Nan Li is interested in working with 10-605 students on this project.
Network Analysis of Congressional Tweets
Tae Yano (firstname.lastname@example.org) is a PhD student that works with Noah Smith on predictive modeling of text, especially in politics. She has collected a large corpus of tweets from and between members of congress - described here: File:README twt cgrs data.txt. There are a number of interesting things that could be done with this, such as:
- Link prediction - predict when new links are formed in the network
- Classification - predict some properties of a congressperson (party, region, above/below median funding from various industries, etc) from network properties, text, or some combination of these
- ... what can you think of?
PageRank for ClueWeb 2013
ClueWeb 2009 is a widely-used web crawl distributed by Jamie Callan. He's doing an update, ClueWeb 2013. The data has been crawled and he'd like to compute PageRank for it - but this requires a completely scalable PageRank implementation (following the algorithm we discussed in class). This would be a chance to make a contribution to a real research project that is likely to be heavily used by other researchers in the future.
Doing this will probably require also performing PageRank on other smaller datasets, as a check on correctness of the algorithm. A large project team might want to also explore PageRank variants (e.g., TrustRank) as well, or compare different implementation strategies for performance differences. Jamie Callan is the contact person for this project.