Class meeting for 10-405 Overview
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.
- Before the next class: review your probabilities! You should be familiar with the material in these lectures:
- my overview lecture from 10-601 (lecture from 1-13-2016)
- first 20 minutes of second over lecture for 10-601 (lecture from 1-16-2016, up to the 'joint distribution' section)
The slides used in these lectures are posted here, along with some review notes for what is covered.
And after each lecture in this class there will be a quiz.
- Today's quiz: 
Readings for the Class
- The Unreasonable Effectiveness of Data - Halevy, Pereira, Norvig
- William W. Cohen (1993): Efficient pruning methods for separate-and-conquer rule learning systems in IJCAI 1993: 988-994
- William W. Cohen (1995): Fast effective rule induction in ICML 1995: 115-123.
- Scaling to very very large corpora for natural language disambiguation, Banko & Brill, ACL 2001
Things to remember
- Why use big data?
- Simple learning methods with large data sets can outperform complex learners with smaller datasets
- The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
- The best way to improve performance for a learning system is often to collect more data
- Large datasets often imply large classifiers
- Asymptotic analysis
- It measures number of operations as function of problem size
- Different operations (eg disk seeking, scanning, memory access) can have very very different costs
- Disk access is cheapest when you scan sequentially