Difference between revisions of "Class meeting for 10-405 Streaming Naive Bayes"
From Cohen Courses
Jump to navigationJump to search (→Quiz) |
|||
Line 16: | Line 16: | ||
=== Things to Remember === | === Things to Remember === | ||
+ | |||
+ | * What TFIDF weighting is and how to compute it | ||
+ | ** Computing DFs requires extra pass over training set | ||
+ | * How it's used in Rocchio | ||
* Zipf's law and the prevalence of rare features/words | * Zipf's law and the prevalence of rare features/words | ||
+ | |||
* Communication complexity | * Communication complexity | ||
* Stream and sort | * Stream and sort |
Latest revision as of 11:07, 5 March 2018
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.
Slides
- Slides in Powerpoint - the stream-and-sort pattern, and large-vocabulary Naive Bayes
- Slides in PDF
Quiz
Readings for the Class
- Required: my notes on streaming and Naive Bayes
- Optional: If you're interested in reading more about smoothing for naive Bayes, I recommend this paper: Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive Bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.
Things to Remember
- What TFIDF weighting is and how to compute it
- Computing DFs requires extra pass over training set
- How it's used in Rocchio
- Zipf's law and the prevalence of rare features/words
- Communication complexity
- Stream and sort
- Complexity of merge sort
- How pipes implement parallel processing
- How buffering output before a sort can improve performance
- How stream-and-sort can implement event-counting for naive Bayes