Projects for Machine Learning with Large Datasets 10-605 in Spring 2012
You are required to do a one-month short project. The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets, or to do some significant work with a new an interesting large dataset.
Large Datasets That Are Available
Google books n-grams data was used for the phrase-finding project. Some other tasks involving phrases - such as unsupervised learning of sentiment-bearing phrases - were discussed in class. There is also a Microsoft n-gram API for Web text, which might also be useful for some projects.
N-gram datasets from Google are often only frequent n-grams. It might be interesting to see how this truncation affects use of these phrases. Project Gutenberg allows bulk download of their 70k books, which could be used to create your own n-gram corpus.
Wikipedia, especially the preprocessed versions available in DBPedia, is a useful source of data and ideas. These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases. Some of the links are manually added - could you design a learning process to propose and evaluate links between Wikipedia and some other database? what sort of other tasks seem interesting to consider?
The NELL project distributes a number of large datasets including noun-phrases and the word context they appear in (extracted from Web text). This data is also available for noun phrase pairs. Additionally there is a collection of subject-verb-object triples from a parsed English text - there's a copy of this, with some documentation, in /afs/cs/project/bigML/nell.
Geonames has complete dumps of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc). Some questions you might think about:
- Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
- Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)? Can you match them with non-geolocated wikipedia pages?
- Can you use the results of possible matches to wikipedia to do a better job at predicting the type of a geographical location?
The Million song dataset has audio data for a lot of songs - actually a million - and pointers to a number of interesting related information, like List.fm ratings and tags for the songs and lyrics. A number of plausible large-scale learning tasks are described there (for instance - classify songs by year of release).
Amazon publically hosts many large datasets.
The AOL query log data is available from a number of sources - several million queries with click-through data and session information. (This data was released and then unreleased by AOL due to privacy issues - please be sensitive about the privacy implications of the data, especially the session-related data.)
Jure Leskovic's group at Stanford hosts the SNAP repository, which includes many large graph datasets. Some of these are also wikipedia-related (e.g., a graph of which editors edited which wikipedia pages, and which editors have communicated with each other, or related to other datasets listed above (e.g., checkin data from 4-square-like geographically oriented social-network services).
Problems to Think About
Nearest Neighbor based Greedy Coordinate Descent
This is a work done by I. Dhillon, P. Ravikumar, and A. Tewari in their NIPS 2011 paper.
This paper presents an interesting approach to coordinate descent learning of high dimensional linear models. For linear models, the gradient along a coordinate is the inner product of the corresponding feature’s data vector and the gradient vector. Therefore, finding the coordinate with the largest gradient can be proximate by finding the feature vector which is closest to the gradient vector, which can be approximately solved by indexing techniques such as Locality Sensitive Hashing (LSH).
This paper raises a new line of research where indexing techniques such as LSH become a critical component for learning with high dimensional problems. This technique can potentially be applied to problems such as topic modeling (e.g. LDA), or graphical model structure learning.
You can reproduce their experiment result, or apply their technique to problem of your choice.
Word context and word meaning
The advent of the WWW has given us a huge amount of text data. That data contains many words used in different contexts with different meanings. Can you use the patterns of word context to infer something about word meaning?
For example, consider all of the word co-occurrences with the noun "apple". Now consider the subset of those word co-occurrences that appear when the adjective rotten comes before apple. What does that change in co-occurrence data tell you about the adjective "rotten"? Does it imply that a rotten apple is no longer something a person would want to eat? In addition, "rotten apple" has a metaphorical meaning (the free dictionary defines it as a person with a corrupting influence). Can you detect the multiple meanings from the co-occurrence data?
Classifying into a large hierarchy
Can you use the structure of a hierarchy of labels to improve the classification of documents (or anything else) into that hierarchy? There are many approaches to this problem. One is discussed in this paper: http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p213.pdf You could propose a new one, or extend and existing one.
For this project you could use the Reuters news wire data and its hierarchical labels, or propose another large hierarchical data set.
Experimentally analyzing performance of large-scale learning systems
During the course we will frequently discuss possible ways of speeding up or scaling up a learning system. Some we'll talk about in detail, and some we won't. Some are plausible but haven't been explored in any depth in the literature, either because they are new ideas (like using approximate counts for Naive Bayes) or because they rely on using newer types of hardware (SSD disks or GPUs). Pick some of these methods and test them experimentally on a large dataset - see how fast you can make them!
Building a new classifier for NELL
(From Tom Mitchell) - The NELL project has produced some corpus statistics describing subject-verb-object (SVO) triples collected by dependency parsing hundreds of millions of sentences. A typical entry is subject=horses, verb=eat, object=hay, frequency=412. Separately, NELL's knowledge base contains instances of hundreds of relations, from riverFlowsThroughCity(river, city), to animalEatsFood(animal, food). (e.g., here is a list of NELL's beliefs about animalEatFood(animal, food): http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:animaleatfood).
In this project, the key idea is to train a "bag of verbs" classifier for each relation in NELL, basing the prediction on the SVO data, and using NELL's existing knowledge base as labeled data. Given a pair of noun phrases such as <cat, fish>, or <cat, house>, your classifier should predict whether the noun phrase pair satisfies the relation. NELL does not have such a classifier, and if yours works well, we will incorporate it into NELL as a new component, potentially leading to a publication as well. There are several approaches one can take to this problem. For example, a baseline approach could train the relation classifiers independently for each relation, using the raw SVO data. You can refine this by thinking of ways of perhaps preprocessing the data (e.g., use stemming so that the triples <horse eats hay> and <horses eat hay> are merged), and perhaps you can think of ways to couple the training of multiple relations (e.g., how would you couple the training of the relations CEOofCompany(person,company) and WorksAtCompany(person, company).
Parallel stochastic block models
(For those willing to read ahead, or who are interested in topic modeling.) Newman et al (JMLR 2009, "Distributed algorithms for topic models") described a map-reduce version of LDA. William and Ramnath Balasubramanyan have been experimenting with an LDA-like network model due to Parkkinen et al ("A block model suitable for sparse graphs", MLG 2009). To my knowledge there is no parallel version of this learning algorithm, although the ideas of Newman et al seem to be easily adaptable to it.