Difference between revisions of "Class meeting for 10-605 in Fall 2016 Overview"
From Cohen Courses
Jump to navigationJump to search (→Slides) |
|||
Line 3: | Line 3: | ||
=== Slides === | === Slides === | ||
− | * [http://www.cs.cmu.edu/~wcohen/10-605/overview.pptx Slides in Powerpoint] | + | * [http://www.cs.cmu.edu/~wcohen/10-605/2016/overview.pptx Slides in Powerpoint] |
− | * [http://www.cs.cmu.edu/~wcohen/10-605/overview.pdf Slides in PDF] | + | * [http://www.cs.cmu.edu/~wcohen/10-605/2016/overview.pdf Slides in PDF] |
=== Homework === | === Homework === |
Revision as of 16:33, 1 August 2017
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall 2016.
Slides
Homework
- Before the next class: watch My overview lecture from 10-601 (lecture 1, and a little of lecture 2) if you need it.
- Today's quiz: [1]
Readings for the Class
- The Unreasonable Effectiveness of Data - Halevy, Pereira, Norvig
Also discussed
- William W. Cohen (1993): Efficient pruning methods for separate-and-conquer rule learning systems in IJCAI 1993: 988-994
- William W. Cohen (1995): Fast effective rule induction in ICML 1995: 115-123.
- Scaling to very very large corpora for natural language disambiguation, Banko & Brill, ACL 2001
Things to remember
- Why use big data?
- Simple learning methods with large data sets can outperform complex learners with smaller datasets
- The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
- The best way to improve performance for a learning system is often to collect more data
- Large datasets often imply large classifiers
- Asymptotic analysis
- It measures number of operations as function of problem size
- Different operations (eg disk seeking, scanning, memory access) can have very very different costs
- Disk access is cheapest when you scan sequentially