Machine Learning with Large Datasets 10-605 in Spring 2013

From Cohen Courses
Jump to navigationJump to search

Instructor and Venue

  • Email and forum:
    • Enrolled students should all be in 10605-announce@cs.cmu.edu, which I'll use for announcements.
    • I'll also post announcements to the google group: machine-learning-with-large-datasets-10-605-in-spring-2013 and this is the preferred way of contacting us with course-related questions.

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

The class will include frequent programming assignments, and a one-month short project chosen by the student. The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.

Syllabus

Previous syllabus, for the historically-minded:

Prerequisites

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Projects

This year I'm offering students two options:

  1. Do a course project in a small group, and at least six assignments.
  2. Skip the project, but due at least nine assignments. Six assignments have been posted; one is posted in draft form (from last year); and three more (total of 10) will be posted later in the course.

If you're going to do the project, you need to turn in an initial proposal 3/20, which we will give feedback on; a final proposal two weeks after, on April 3rd, a 1-page status report on April 17, and give a short talk and turn in the writeup on May 3. Project groups should be 2-3 people. It's fine to advertise that you're looking for a teammate on the googlegroup. If you're not doing the project, you should submit a statement of that instead of the "initial project proposal". Here are some pages to help you get started planning.

Datasets

Some datasets will be provided by the instructors to use in the course.

  • RCV2 - text classification dataset.
  • Wikipedia links - page-page links for Wikipedia.
  • Geographical names and places - data on places from GeoNames, Wikipedia, and Geo-tagged Flikr images.
  • NELL all-pairs data - NPs and the contexts they appear in on the web.
  • Google n-grams.
  • ?Million Song Database - audio signatures of songs with tags and meta-data.
  • ?KDD search-engine queries.