Class meeting for 10-605 Rocchio and Hadoop Workflows
From Cohen Courses
Jump to navigationJump to searchThis is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2015.
Slides
Workflows for Hadoop:
- Workflows for Hadoop - Powerpoint, PDF
- The phrases example:
- Some other examples:
Rocchio:
Also:
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
Readings for the Class
- Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
- Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, MLJ 1988. Includes the mistake-bound theory.
Things to Remember
- The TFIDF representation for documents.
- The Rocchio algorithm.
- Why Rocchio is easy to parallelize.