Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"
From Cohen Courses
Jump to navigationJump to search (Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Datase...") |
(→Slides) |
||
(19 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall | + | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2017]]. |
=== Slides === | === Slides === | ||
− | * | + | * First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pdf in PDF]. |
+ | * Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pdf in PDF]. | ||
+ | * Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pdf in PDF]. | ||
+ | * Catchup on simjoins: Slides [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pdf in PDF]. | ||
+ | |||
+ | === Quizzes === | ||
+ | |||
+ | * [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture] | ||
+ | * [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture] | ||
+ | * [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture] | ||
=== Readings === | === Readings === | ||
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | * Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | ||
− | + | *Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. | |
− | |||
− | *[http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. | ||
=== Also discussed === | === Also discussed === | ||
Line 22: | Line 29: | ||
* The TFIDF representation for documents. | * The TFIDF representation for documents. | ||
− | * | + | * What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is. |
− | * | + | * How joins are implemented in dataflow (and the difference between map-side and reduce-side joins) |
+ | * What the PageRank algorithm is | ||
+ | * Common ways of representing graphs in map-reduce system | ||
+ | ** A list of edges | ||
+ | ** A list of nodes with outlinks | ||
+ | * Why iteration is often expensive in pure dataflow algorithms. | ||
+ | * How Spark differs from and/or is similar to other dataflow algorithms | ||
+ | ** Actions/transformations | ||
+ | ** RDDs | ||
+ | ** Caching | ||
+ | * Definition of a similarity join/soft join. | ||
+ | * Why inverted indices make TFIDF representations useful for similarity joins | ||
+ | ** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure |
Latest revision as of 12:37, 19 September 2017
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2017.
Slides
- First lecture: Slides in Powerpoint, in PDF.
- Second lecture: Slides in Powerpoint, in PDF.
- Third lecture: Slides in Powerpoint, in PDF.
- Catchup on simjoins: Slides in Powerpoint, in PDF.
Quizzes
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
- Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
Things to Remember
- The TFIDF representation for documents.
- What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
- How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
- What the PageRank algorithm is
- Common ways of representing graphs in map-reduce system
- A list of edges
- A list of nodes with outlinks
- Why iteration is often expensive in pure dataflow algorithms.
- How Spark differs from and/or is similar to other dataflow algorithms
- Actions/transformations
- RDDs
- Caching
- Definition of a similarity join/soft join.
- Why inverted indices make TFIDF representations useful for similarity joins
- e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure