Difference between revisions of "Class meeting for 10-605 Rocchio and Hadoop Workflows"

From Cohen Courses
Jump to navigationJump to search
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
Workflows for Hadoop:
 
Workflows for Hadoop:
  
* [http://www.cs.cmu.edu/~wcohen/10-605/beyond-hadoop.pptx Workflows for Hadoop]
+
* [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pptx Workflows for Hadoop - Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pdf PDF]
 
* The phrases example:
 
* The phrases example:
 
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phrases.pig PIG source code]
 
** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phrases.pig PIG source code]
Line 32: Line 32:
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory.
 +
 +
=== Things to Remember ===
 +
 +
* The TFIDF representation for documents.
 +
* The Rocchio algorithm.
 +
* Why Rocchio is easy to parallelize.

Latest revision as of 16:16, 14 October 2015

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2015.

Slides

Workflows for Hadoop:

Rocchio:

Also:

Readings

  • Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.

Readings for the Class

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • The Rocchio algorithm.
  • Why Rocchio is easy to parallelize.