Class meeting for 10-405 Streaming Naive Bayes

From Cohen Courses
Jump to navigationJump to search

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.



Readings for the Class

  • Required: my notes on streaming and Naive Bayes
  • Optional: If you're interested in reading more about smoothing for naive Bayes, I recommend this paper: Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive Bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.

Things to Remember

  • What TFIDF weighting is and how to compute it
    • Computing DFs requires extra pass over training set
  • How it's used in Rocchio
  • Zipf's law and the prevalence of rare features/words
  • Communication complexity
  • Stream and sort
    • Complexity of merge sort
    • How pipes implement parallel processing
    • How buffering output before a sort can improve performance
    • How stream-and-sort can implement event-counting for naive Bayes