Difference between revisions of "Class meeting for 10-405 Streaming Naive Bayes"

From Cohen Courses
Jump to navigationJump to search
(Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Data...")
 
 
(2 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
=== Quiz ===
 
=== Quiz ===
  
* [https://qna.cs.cmu.edu/#/pages/view/161 Today's quiz]
+
* [https://qna.cs.cmu.edu/#/pages/view/161 Today's quiz].
  
 
=== Readings for the Class ===
 
=== Readings for the Class ===
Line 16: Line 16:
  
 
=== Things to Remember ===
 
=== Things to Remember ===
 +
 +
* What TFIDF weighting is and how to compute it
 +
** Computing DFs requires extra pass over training set
 +
* How it's used in Rocchio
  
 
* Zipf's law and the prevalence of rare features/words
 
* Zipf's law and the prevalence of rare features/words
 +
 
* Communication complexity
 
* Communication complexity
 
* Stream and sort
 
* Stream and sort

Latest revision as of 11:07, 5 March 2018

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.

Slides

Quiz

Readings for the Class

  • Required: my notes on streaming and Naive Bayes
  • Optional: If you're interested in reading more about smoothing for naive Bayes, I recommend this paper: Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive Bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.

Things to Remember

  • What TFIDF weighting is and how to compute it
    • Computing DFs requires extra pass over training set
  • How it's used in Rocchio
  • Zipf's law and the prevalence of rare features/words
  • Communication complexity
  • Stream and sort
    • Complexity of merge sort
    • How pipes implement parallel processing
    • How buffering output before a sort can improve performance
    • How stream-and-sort can implement event-counting for naive Bayes