Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"
From Cohen Courses
Jump to navigationJump to search (→Quiz) |
|||
Line 14: | Line 14: | ||
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | * Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | ||
− | + | ||
− | *[http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. | + | *Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. |
=== Also discussed === | === Also discussed === |
Revision as of 16:17, 13 September 2016
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2016.
Slides
- First lecture: Slides in Powerpoint, in PDF.
- Second lecture: Slides in Powerpoint, in PDF.
Quiz
- Quiz for first lecture.
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
- Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
Things to Remember
- The TFIDF representation for documents.
- The Rocchio algorithm.
- Why Rocchio is easy to parallelize.
- Definition of a similarity join/soft join.
- Why inverted indices make TFIDF representations useful for similarity joins
- e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure