Difference between revisions of "Projects for Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
Line 2: Line 2:
 
The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.
 
The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.
  
Here are some possible project ideas
+
== Large Datasets That Are Available ==
  
==Nearest Neighbor based Greedy Coordinate Descent==
+
Google [http://books.google.com/ngrams/datasets books n-grams data].  There is also a [http://web-ngram.research.microsoft.com/info/ Microsoft n-gram API].
 +
 
 +
Wikipedia, especially the preprocessed versions available in [http://dbpedia.org/About DBPedia].  These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases.
 +
 
 +
Project Gutenberg [http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages allows bulk download] of their 70k books.
 +
 
 +
The NELL project distributes a number of large datasets including some statistics on common noun-phrases and the word context they appear in.  This data is also available for noun phrase pairs.  Additionally there is a collection of subject-verb-object triples from a parsed English text. '''Links to be added soon.'''
 +
 
 +
[http://www.geonames.org/ Geonames] has [http://download.geonames.org/export/dump/ complete dumps] of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc).  Some questions you might think about.
 +
* Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
 +
* Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)?  Can you match them with non-geolocated wikipedia pages?
 +
* Can you use the results of possible matches to wikipedia to do a better job at predict the type of a geographical location?
 +
 
 +
== Problems to Think About ==
 +
 
 +
=== Nearest Neighbor based Greedy Coordinate Descent===
 
This is a work done by I. Dhillon, P. Ravikumar, and A. Tewari in their NIPS 2011 paper.  
 
This is a work done by I. Dhillon, P. Ravikumar, and A. Tewari in their NIPS 2011 paper.  
  
Line 13: Line 28:
 
You can reproduce their experiment result, or apply their technique to problem of your choice.
 
You can reproduce their experiment result, or apply their technique to problem of your choice.
  
==Word context and word meaning==
+
===Word context and word meaning===
 
The advent of the WWW has given us a huge amount of text data.  That data contains many words used in different contexts with different meanings.  Can you use the patterns of word context to infer something about word meaning?
 
The advent of the WWW has given us a huge amount of text data.  That data contains many words used in different contexts with different meanings.  Can you use the patterns of word context to infer something about word meaning?
  
 
For example, consider all of the word co-occurrences with the noun "apple".  Now consider the subset of those word co-occurrences that appear when the adjective rotten comes before apple.  What does that change in co-occurrence data tell you about the adjective "rotten"?  Does it imply that a rotten apple is no longer something a person would want to eat?  In addition, "rotten apple" has a metaphorical meaning (the free dictionary defines it as a person with a corrupting influence).  Can you detect the multiple meanings from the co-occurrence data?
 
For example, consider all of the word co-occurrences with the noun "apple".  Now consider the subset of those word co-occurrences that appear when the adjective rotten comes before apple.  What does that change in co-occurrence data tell you about the adjective "rotten"?  Does it imply that a rotten apple is no longer something a person would want to eat?  In addition, "rotten apple" has a metaphorical meaning (the free dictionary defines it as a person with a corrupting influence).  Can you detect the multiple meanings from the co-occurrence data?
  
==Classifying into a large hierarchy==
+
===Classifying into a large hierarchy===
 
Can you use the structure of a hierarchy of labels to improve the classification of documents (or anything else) into that hierarchy?  There are many approaches to this problem.  One is discussed in this paper: http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p213.pdf  You could propose a new one, or extend and existing one.
 
Can you use the structure of a hierarchy of labels to improve the classification of documents (or anything else) into that hierarchy?  There are many approaches to this problem.  One is discussed in this paper: http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p213.pdf  You could propose a new one, or extend and existing one.
  
 
For this project you could use the Reuters news wire data and its hierarchical labels, or propose another large hierarchical data set.
 
For this project you could use the Reuters news wire data and its hierarchical labels, or propose another large hierarchical data set.
  
== Experimentally analyzing performance of large-scale learning systems ==
+
=== Experimentally analyzing performance of large-scale learning systems ===
  
 
During the course we will frequently discuss possible ways of speeding up or scaling up a learning system.  Some we'll talk about in detail, and some we won't.  Some are plausible but haven't been explored in any depth in the literature, either because they are new ideas (like using approximate counts for Naive Bayes) or because they rely on using newer types of hardware (SSD disks or GPUs).  Pick some of these methods and test them experimentally on a large dataset - see how fast you can make them!
 
During the course we will frequently discuss possible ways of speeding up or scaling up a learning system.  Some we'll talk about in detail, and some we won't.  Some are plausible but haven't been explored in any depth in the literature, either because they are new ideas (like using approximate counts for Naive Bayes) or because they rely on using newer types of hardware (SSD disks or GPUs).  Pick some of these methods and test them experimentally on a large dataset - see how fast you can make them!

Revision as of 12:57, 27 February 2012

You are required to do a one-month short project. The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.

Large Datasets That Are Available

Google books n-grams data. There is also a Microsoft n-gram API.

Wikipedia, especially the preprocessed versions available in DBPedia. These include a number of interesting and easy-to-process datasets such as hyperlinks, geographical locations, categories, as well as links from wikipedia pages to external databases.

Project Gutenberg allows bulk download of their 70k books.

The NELL project distributes a number of large datasets including some statistics on common noun-phrases and the word context they appear in. This data is also available for noun phrase pairs. Additionally there is a collection of subject-verb-object triples from a parsed English text. Links to be added soon.

Geonames has complete dumps of their geographical data - which consists of 7M names of places together with lat/long, plus a minimal amount of additional information, including a "feature" code that describes what type the location is (e.g., a city, a park, a lake, etc). Some questions you might think about.

  • Can you predict what feature is associated with a place from the other data in the record (the names for the place, the location?)
  • Can you match the geoname records with geolocated wikipedia pages (warning - this is non-trivial)? Can you match them with non-geolocated wikipedia pages?
  • Can you use the results of possible matches to wikipedia to do a better job at predict the type of a geographical location?

Problems to Think About

Nearest Neighbor based Greedy Coordinate Descent

This is a work done by I. Dhillon, P. Ravikumar, and A. Tewari in their NIPS 2011 paper.

This paper presents an interesting approach to coordinate descent learning of high dimensional linear models. For linear models, the gradient along a coordinate is the inner product of the corresponding feature’s data vector and the gradient vector. Therefore, finding the coordinate with the largest gradient can be proximate by finding the feature vector which is closest to the gradient vector, which can be approximately solved by indexing techniques such as Locality Sensitive Hashing (LSH).

This paper raises a new line of research where indexing techniques such as LSH become a critical component for learning with high dimensional problems. This technique can potentially be applied to problems such as topic modeling (e.g. LDA), or graphical model structure learning.

You can reproduce their experiment result, or apply their technique to problem of your choice.

Word context and word meaning

The advent of the WWW has given us a huge amount of text data. That data contains many words used in different contexts with different meanings. Can you use the patterns of word context to infer something about word meaning?

For example, consider all of the word co-occurrences with the noun "apple". Now consider the subset of those word co-occurrences that appear when the adjective rotten comes before apple. What does that change in co-occurrence data tell you about the adjective "rotten"? Does it imply that a rotten apple is no longer something a person would want to eat? In addition, "rotten apple" has a metaphorical meaning (the free dictionary defines it as a person with a corrupting influence). Can you detect the multiple meanings from the co-occurrence data?

Classifying into a large hierarchy

Can you use the structure of a hierarchy of labels to improve the classification of documents (or anything else) into that hierarchy? There are many approaches to this problem. One is discussed in this paper: http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p213.pdf You could propose a new one, or extend and existing one.

For this project you could use the Reuters news wire data and its hierarchical labels, or propose another large hierarchical data set.

Experimentally analyzing performance of large-scale learning systems

During the course we will frequently discuss possible ways of speeding up or scaling up a learning system. Some we'll talk about in detail, and some we won't. Some are plausible but haven't been explored in any depth in the literature, either because they are new ideas (like using approximate counts for Naive Bayes) or because they rely on using newer types of hardware (SSD disks or GPUs). Pick some of these methods and test them experimentally on a large dataset - see how fast you can make them!