Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"

From Cohen Courses
Jump to navigationJump to search
Line 3: Line 3:
 
=== Slides ===
 
=== Slides ===
  
* Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pdf in PDF],  
+
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pdf in PDF].
 +
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflow-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflow-2.pdf in PDF].
  
 
=== Quiz ===
 
=== Quiz ===

Revision as of 16:16, 13 September 2016

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2016.

Slides

Quiz

Readings

  • Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.

Readings for the Class

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • The Rocchio algorithm.
  • Why Rocchio is easy to parallelize.
  • Definition of a similarity join/soft join.
  • Why inverted indices make TFIDF representations useful for similarity joins
    • e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure