Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"

Revision as of 16:17, 13 September 2016

Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.

Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.

Definition of a similarity join/soft join.
Why inverted indices make TFIDF representations useful for similarity joins
- e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure

Revision as of 16:17, 13 September 2016 (view source) Wcohen (talk \| contribs) (→‎Readings for the Class) ← Older edit		Revision as of 16:17, 13 September 2016 (view source) Wcohen (talk \| contribs) (→‎Readings) Newer edit →
Line 13:		Line 13:

	* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].		* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].
−
−
	*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.		*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.