Difference between revisions of "Class meeting for 10-405 Workflows For Hadoop"

Revision as of 17:41, 7 February 2018

Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.

Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.

The TFIDF representation for documents.
What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
What the PageRank algorithm is
Common ways of representing graphs in map-reduce system
- A list of edges
- A list of nodes with outlinks
Why iteration is often expensive in pure dataflow algorithms.
How Spark differs from and/or is similar to other dataflow algorithms
- Actions/transformations
- RDDs
- Caching
Definition of a similarity join/soft join.

How to implement k-means in a map-reduce setting with dataflow
- Not discussed in class, but in the slide deck!

@@ Line 39: / Line 39: @@
 ** Caching
 * Definition of a similarity join/soft join.
-* Why inverted indices make TFIDF representations useful for similarity joins
-** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure
+* How to implement k-means in a map-reduce setting with dataflow
+** Not discussed in class, but in the slide deck!