Difference between revisions of "Class meeting for 10-605 Overview"

From Cohen Courses
Jump to navigationJump to search
 
Line 1: Line 1:
#REDIRECT [[Class meeting for 10-605 in Fall 2016 Overview]]
+
 
 +
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall 2016]].
 +
 
 +
=== Slides ===
 +
 
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/overview.pptx Slides in Powerpoint]
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/overview.pdf Slides in PDF]
 +
 
 +
=== Homework ===
 +
 
 +
* Before the next class: review your probabilities!  You should be familiar with the material in these lectures:
 +
**  [https://mediatech-stream.andrew.cmu.edu/Mediasite/Play/9e04feebd4bb4900a8c828388be620d91d?catalog=81e613d0-fda8-47a4-8340-86b96d5a3cbb my overview lecture from 10-601 ] (lecture from 1-13-2016)
 +
** [https://mediatech-stream.andrew.cmu.edu/Mediasite/Play/e99b074dadb24a11a68b6dae418ac9a91d?catalog=81e613d0-fda8-47a4-8340-86b96d5a3cbb first 20 minutes of second over lecture for 10-601] (lecture from 1-16-2016, up to the 'joint distribution' section)
 +
The slides used in these lectures are [[10-601_Introduction_to_Probability|posted here]], along with some review notes for what is covered.
 +
 
 +
And after each lecture in this class there will be a quiz.
 +
* Today's quiz: [https://qna-app.appspot.com/edit_new.html#/pages/view/aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgIDQqdaqCQw]
 +
 
 +
=== Readings for the Class ===
 +
 
 +
* [http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/35179.pdf The Unreasonable Effectiveness of Data] - Halevy, Pereira, Norvig
 +
 
 +
=== Also discussed ===
 +
* [http://www-2.cs.cmu.edu/~wcohen/postscript/ijcai-93.ps William W. Cohen (1993): Efficient pruning methods for separate-and-conquer rule learning systems in IJCAI 1993: 988-994]
 +
* [http://www-2.cs.cmu.edu/~wcohen/postscript/ml-95-ripper.ps William W. Cohen (1995): Fast effective rule induction in ICML 1995: 115-123.]
 +
* [http://dl.acm.org/citation.cfm?id=1073017&bnc=1 Scaling to very very large corpora for natural language disambiguation], Banko & Brill, ACL 2001
 +
 
 +
=== Things to remember ===
 +
 
 +
* Why use big data?
 +
** Simple learning methods with large data sets can outperform complex learners with smaller datasets
 +
** The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
 +
** The best way to improve performance for a learning system is often to collect more data
 +
** Large datasets often imply large classifiers
 +
 
 +
* Asymptotic analysis
 +
** It measures number of operations as function of problem size
 +
** Different operations (eg disk seeking, scanning, memory access) can have very very different costs
 +
** Disk access is cheapest when you scan sequentially

Revision as of 15:57, 10 August 2017

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall 2016.

Slides

Homework

The slides used in these lectures are posted here, along with some review notes for what is covered.

And after each lecture in this class there will be a quiz.

  • Today's quiz: [1]

Readings for the Class

Also discussed

Things to remember

  • Why use big data?
    • Simple learning methods with large data sets can outperform complex learners with smaller datasets
    • The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
    • The best way to improve performance for a learning system is often to collect more data
    • Large datasets often imply large classifiers
  • Asymptotic analysis
    • It measures number of operations as function of problem size
    • Different operations (eg disk seeking, scanning, memory access) can have very very different costs
    • Disk access is cheapest when you scan sequentially