Difference between revisions of "Class meeting for 10-605 Rocchio and Hadoop Workflows"

From Cohen Courses
Jump to navigationJump to search
(Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Data...")
 
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Spring_2014]].
+
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2015]].
  
 
=== Slides ===
 
=== Slides ===
  
* [http://www.cs.cmu.edu/~wcohen/10-605/2013/other-streamers.pptx Slides in Powerpoint]
+
Workflows for Hadoop:
 +
 
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pptx Workflows for Hadoop - Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pdf PDF]
 +
* The phrases example:
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phrases.pig PIG source code]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/SmoothedPKL.java Java source code]
 +
* Some other examples:
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl-naive.pig Naive Similarity Join]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl.pig Optimized Similarity Join]
 +
 
 +
Rocchio:
 +
 
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pptx Rocchio - Another Fast Streaming Learning Algorithm - PPT], [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pdf PDF]
 +
Also:
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/tips-for-debugging-pig.txt My comments on debugging PIG.]
 +
 
 +
=== Readings ===
 +
 
 +
* Pig: none required.  A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].
  
 
=== Readings for the Class ===
 
=== Readings for the Class ===
Line 14: Line 32:
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 +
 +
=== Things to Remember ===
 +
 +
* The TFIDF representation for documents.
 +
* The Rocchio algorithm.
 +
* Why Rocchio is easy to parallelize.

Latest revision as of 16:16, 14 October 2015

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2015.

Slides

Workflows for Hadoop:

Rocchio:

Also:

Readings

  • Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.

Readings for the Class

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • The Rocchio algorithm.
  • Why Rocchio is easy to parallelize.