Difference between revisions of "Class meeting for 10-405 Streaming Naive Bayes"

From Cohen Courses
Jump to navigationJump to search
 
Line 16: Line 16:
  
 
=== Things to Remember ===
 
=== Things to Remember ===
 +
 +
* What TFIDF weighting is and how to compute it
 +
** Computing DFs requires extra pass over training set
 +
* How it's used in Rocchio
  
 
* Zipf's law and the prevalence of rare features/words
 
* Zipf's law and the prevalence of rare features/words
 +
 
* Communication complexity
 
* Communication complexity
 
* Stream and sort
 
* Stream and sort

Latest revision as of 11:07, 5 March 2018

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.

Slides

Quiz

Readings for the Class

  • Required: my notes on streaming and Naive Bayes
  • Optional: If you're interested in reading more about smoothing for naive Bayes, I recommend this paper: Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive Bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.

Things to Remember

  • What TFIDF weighting is and how to compute it
    • Computing DFs requires extra pass over training set
  • How it's used in Rocchio
  • Zipf's law and the prevalence of rare features/words
  • Communication complexity
  • Stream and sort
    • Complexity of merge sort
    • How pipes implement parallel processing
    • How buffering output before a sort can improve performance
    • How stream-and-sort can implement event-counting for naive Bayes