Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"
From Cohen Courses
Jump to navigationJump to search (→Quiz) |
(→Slides) |
||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall | + | This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2017]]. |
=== Slides === | === Slides === | ||
− | * Slides [http://www.cs.cmu.edu/~wcohen/10-605/ | + | * First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pdf in PDF]. |
+ | * Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pdf in PDF]. | ||
+ | * Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pdf in PDF]. | ||
− | + | * Catchup on simjoins: Slides [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pdf in PDF]. | |
− | * [https://qna | + | === Quizzes === |
+ | |||
+ | * [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture] | ||
+ | * [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture] | ||
+ | * [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture] | ||
=== Readings === | === Readings === | ||
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | * Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig]. | ||
− | + | *Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. | |
− | |||
− | *[http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method. | ||
=== Also discussed === | === Also discussed === | ||
Line 25: | Line 29: | ||
* The TFIDF representation for documents. | * The TFIDF representation for documents. | ||
− | * | + | * What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is. |
− | * | + | * How joins are implemented in dataflow (and the difference between map-side and reduce-side joins) |
− | + | * What the PageRank algorithm is | |
+ | * Common ways of representing graphs in map-reduce system | ||
+ | ** A list of edges | ||
+ | ** A list of nodes with outlinks | ||
+ | * Why iteration is often expensive in pure dataflow algorithms. | ||
+ | * How Spark differs from and/or is similar to other dataflow algorithms | ||
+ | ** Actions/transformations | ||
+ | ** RDDs | ||
+ | ** Caching | ||
* Definition of a similarity join/soft join. | * Definition of a similarity join/soft join. | ||
* Why inverted indices make TFIDF representations useful for similarity joins | * Why inverted indices make TFIDF representations useful for similarity joins | ||
** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure | ** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure |
Latest revision as of 12:37, 19 September 2017
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2017.
Slides
- First lecture: Slides in Powerpoint, in PDF.
- Second lecture: Slides in Powerpoint, in PDF.
- Third lecture: Slides in Powerpoint, in PDF.
- Catchup on simjoins: Slides in Powerpoint, in PDF.
Quizzes
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
- Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
Things to Remember
- The TFIDF representation for documents.
- What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
- How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
- What the PageRank algorithm is
- Common ways of representing graphs in map-reduce system
- A list of edges
- A list of nodes with outlinks
- Why iteration is often expensive in pure dataflow algorithms.
- How Spark differs from and/or is similar to other dataflow algorithms
- Actions/transformations
- RDDs
- Caching
- Definition of a similarity join/soft join.
- Why inverted indices make TFIDF representations useful for similarity joins
- e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure