Difference between revisions of "Class meeting for 10-605 Overview"

Revision as of 15:57, 10 August 2017

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall 2016.

Why use big data?
- Simple learning methods with large data sets can outperform complex learners with smaller datasets
- The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
- The best way to improve performance for a learning system is often to collect more data
- Large datasets often imply large classifiers

Asymptotic analysis
- It measures number of operations as function of problem size
- Different operations (eg disk seeking, scanning, memory access) can have very very different costs
- Disk access is cheapest when you scan sequentially

@@ Line 1: / Line 1: @@
-#REDIRECT [[Class meeting for 10-605 in Fall 2016 Overview]]
+This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall 2016]].
+=== Slides ===
+* [http://www.cs.cmu.edu/~wcohen/10-605/overview.pptx Slides in Powerpoint]
+* [http://www.cs.cmu.edu/~wcohen/10-605/overview.pdf Slides in PDF]
+=== Homework ===
+* Before the next class: review your probabilities!  You should be familiar with the material in these lectures:
+**  [https://mediatech-stream.andrew.cmu.edu/Mediasite/Play/9e04feebd4bb4900a8c828388be620d91d?catalog=81e613d0-fda8-47a4-8340-86b96d5a3cbb my overview lecture from 10-601 ] (lecture from 1-13-2016)
+** [https://mediatech-stream.andrew.cmu.edu/Mediasite/Play/e99b074dadb24a11a68b6dae418ac9a91d?catalog=81e613d0-fda8-47a4-8340-86b96d5a3cbb first 20 minutes of second over lecture for 10-601] (lecture from 1-16-2016, up to the 'joint distribution' section)
+The slides used in these lectures are [[10-601_Introduction_to_Probability|posted here]], along with some review notes for what is covered.
+And after each lecture in this class there will be a quiz.
+* Today's quiz: [https://qna-app.appspot.com/edit_new.html#/pages/view/aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgIDQqdaqCQw]
+=== Readings for the Class ===
+* [http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/35179.pdf The Unreasonable Effectiveness of Data] - Halevy, Pereira, Norvig
+=== Also discussed ===
+* [http://www-2.cs.cmu.edu/~wcohen/postscript/ijcai-93.ps William W. Cohen (1993): Efficient pruning methods for separate-and-conquer rule learning systems in IJCAI 1993: 988-994]
+* [http://www-2.cs.cmu.edu/~wcohen/postscript/ml-95-ripper.ps William W. Cohen (1995): Fast effective rule induction in ICML 1995: 115-123.]
+* [http://dl.acm.org/citation.cfm?id=1073017&bnc=1 Scaling to very very large corpora for natural language disambiguation], Banko & Brill, ACL 2001
+=== Things to remember ===
+* Why use big data?
+** Simple learning methods with large data sets can outperform complex learners with smaller datasets
+** The ordering of learning methods, best-to-worst, can be different for small datasets than from large datasets
+** The best way to improve performance for a learning system is often to collect more data
+** Large datasets often imply large classifiers
+* Asymptotic analysis
+** It measures number of operations as function of problem size
+** Different operations (eg disk seeking, scanning, memory access) can have very very different costs
+** Disk access is cheapest when you scan sequentially

Difference between revisions of "Class meeting for 10-605 Overview"

Revision as of 15:57, 10 August 2017

Contents

Slides

Homework

Readings for the Class

Also discussed

Things to remember

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools