Difference between revisions of "Class meeting for 10-605 Rocchio and Hadoop Workflows"
From Cohen Courses
Jump to navigationJump to search(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in | + | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2015]]. |
=== Slides === | === Slides === | ||
− | * [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pptx Rocchio - Another Fast Streaming Learning Algorithm] | + | Workflows for Hadoop: |
+ | |||
+ | * [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pptx Workflows for Hadoop - Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/e_beyond-hadoop.pdf PDF] | ||
+ | * The phrases example: | ||
+ | ** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phrases.pig PIG source code] | ||
+ | ** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/SmoothedPKL.java Java source code] | ||
+ | * Some other examples: | ||
+ | ** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl-naive.pig Naive Similarity Join] | ||
+ | ** [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/phirl.pig Optimized Similarity Join] | ||
+ | |||
+ | Rocchio: | ||
+ | |||
+ | * [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pptx Rocchio - Another Fast Streaming Learning Algorithm - PPT], [http://www.cs.cmu.edu/~wcohen/10-605/rocchio.pdf PDF] | ||
+ | Also: | ||
+ | * [http://www.cs.cmu.edu/~wcohen/10-605/pig-example/tips-for-debugging-pig.txt My comments on debugging PIG.] | ||
+ | |||
+ | === Readings === | ||
+ | |||
+ | * Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | ||
=== Readings for the Class === | === Readings for the Class === | ||
Line 14: | Line 32: | ||
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98. | * Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98. | ||
* Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory. | * Littlestone, [http://www.springerlink.com/index/X1022977778L1777.pdf Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm], MLJ 1988. Includes the mistake-bound theory. | ||
+ | |||
+ | === Things to Remember === | ||
+ | |||
+ | * The TFIDF representation for documents. | ||
+ | * The Rocchio algorithm. | ||
+ | * Why Rocchio is easy to parallelize. |
Latest revision as of 16:16, 14 October 2015
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2015.
Slides
Workflows for Hadoop:
- Workflows for Hadoop - Powerpoint, PDF
- The phrases example:
- Some other examples:
Rocchio:
Also:
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
Readings for the Class
- Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
- Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, MLJ 1988. Includes the mistake-bound theory.
Things to Remember
- The TFIDF representation for documents.
- The Rocchio algorithm.
- Why Rocchio is easy to parallelize.