Difference between revisions of "Class meeting for 10-605 Rocchio and Hadoop Workflows"

From Cohen Courses
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Spring_2014]].
+
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2015]].
  
 
=== Slides ===
 
=== Slides ===
  
* [http://www.cs.cmu.edu/~wcohen/10-605/phrases.pptx Phrase-Finding]
+
Workflows for Hadoop:
* [http://www.cs.cmu.edu/~wcohen/10-605/details.pptx Performance Details - Sorting and Unix Pipes]
+
 
* [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pptx Another Fast Algorithm Streaming Learning Algorithm]
+
* [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pptx Workflows for Hadoop - Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pdf PDF]
 +
* The phrases example:
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phrases.pig PIG source code]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/SmoothedPKL.java Java source code]
 +
* Some other examples:
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl-naive.pig Naive Similarity Join]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl.pig Optimized Similarity Join]
 +
 
 +
Rocchio:
 +
 
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pptx Rocchio - Another Fast Streaming Learning Algorithm - PPT], [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pdf PDF]
 +
Also:
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/tips-for-debugging-pig.txt My comments on debugging PIG.]
 +
 
 +
=== Readings ===
 +
 
 +
* Pig: none required.  A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].
  
 
=== Readings for the Class ===
 
=== Readings for the Class ===
Line 16: Line 32:
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 +
 +
=== Things to Remember ===
 +
 +
* The TFIDF representation for documents.
 +
* The Rocchio algorithm.
 +
* Why Rocchio is easy to parallelize.

Latest revision as of 16:16, 14 October 2015

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2015.

Slides

Workflows for Hadoop:

Rocchio:

Also:

Readings

  • Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.

Readings for the Class

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • The Rocchio algorithm.
  • Why Rocchio is easy to parallelize.