Course Description

In almost every field, there is a need to draw inference from or make decisions based on data. The goal of this course is to provide an introduction to machine learning that is approachable to diverse disciplines and empowers students to become proficient in the foundational concepts and tools while working with interdisciplinary real-world data. You will learn to (a) structure a machine learning problem, (b) determine which algorithmic tools are applicable to a given problem, (c) apply those algorithmic tools to diverse, interdisciplinary data examples, (d) evaluate the performance of your solution, and (e) how to accurately interpret and communicate your results. This course is a fast-paced, applied introduction to machine learning that arms you with the basic skills you will need in practice to both conduct analyses and effectively communicate your results.

Teaching Assistant

Class Time and Location

Monday and Wednesday 10:05am-11:20am
Gross Hall 100C (the Generator)

Office Hours and Email

Kyle Bradbury (kyle.bradbury@duke.edu)
Office: Gross Hall 102L
Hours: Mon 2-4pm, Tues 1:30-2:30pm

Leslie Collins (leslie.collins@duke.edu)
Office: CIEMAS 3461
Hours: Mon 1-2pm, Wed 1-3pm

Bohao Huang (bohao.huang@duke.edu)
Office: CIEMAS 3433
Hours: Tues 3-5pm, Fri 10–11am

Have questions?

We welcome your questions about the course including lectures, assignments, projects, and logistics on Piazza. Email the TA or instructor about questions that specifically pertain to you as an individual. Questions within 24 hours of a due date are unlikely to get a response - please plan accordingly.

Grading

Class participation: 5%
In-class quizzes: 10%
Assignments (5 total, each with 7%): 35%
Kaggle competition: 20%
Final Project: 30%

Assignment and Project Details

Details on assigments and projects will be posted via links on the course syllabus throughout the semester.
For general expectations on the assignments and instructions on how to complete and submit, see the Assignment Instructions. Assignment solutions are posted here.

Textbook and Software

Textbook: An Introduction to Statistical Learning (ISL) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 2013.
Software: We will use Python 3.x. Our recommendation is to use the Anaconda distribution (which includes packages numpy, scipy, matplotlib, and pandas).

Prerequisites

This course moves quickly, so having a firm grasp on prerequisites is important. The prerequisites are:

Detailed description

This course will begin with exploring the purpose of machine learning told through a discussion of the types of problems that machine learning can answer: description of the data – “what has happened?”, predictions based on the data – “what will happen?”, or prescription – “what should happen?” Then we will discuss the tools at our disposal to answer these questions, namely supervised learning including classification and regression; unsupervised learning including clustering and density estimation; and lastly reinforcement learning. There will be a strong focus on how to formulate a machine learning problem. Central to that formulation will be developing an understanding of how to preprocess data for analysis (normalization, cleaning, etc.), sampling and dimensionality reduction, feature and model selection, and performance evaluation with cross validation. The final topic of this course will be a brief overview of state-of-the-art machine learning techniques such as deep learning.

Throughout this course, the focus will be on applying algorithms rather than diving deeply into theory. You will be asked to consider the practical issues of machine learning problem solving: challenges of applying machine learning code packages, striving for parsimony and interpretability, and ensuring model assumptions are valid for a given problem and dataset. This course will also stress the importance of team-based collaboration, the value of producing fully reproducible and validated results, and tools to help with both such as version control and code repositories.

Communicating your results. Data science solutions are only as impactful as the communicator who shares them. Throughout this course all deliverables will include a Jupyter Notebook. This is an interactive writing and coding tool that allow you to combine formatted text, code and output from code including plots, as well as mathematical equations all in one location. Demonstrating competency in data science means (a) exhibiting a working knowledge of technical concepts including programming, statistics, and mathematics and (b) being able to clearly communicate the problem you were trying to solve or question you were trying to answer, why it matters, and how well your analysis worked. You will have opportunities to practice communicating what you learn and in doing so develop a stockpile of content that you could use to establish your own data science portfolio (for more ideas on creating a data science portfolio, follow this link).

Course Policies

Academic dishonesty. Adherence to the Duke Community Standard is expected. To uphold the Duke Community Standard:
I will not lie, cheat, or steal in my academic endeavors;
I will conduct myself honorably in all my endeavors; and
I will act if the Standard is compromised
Anyone found in violation of the Standard will be reported to the Office of Student Conduct.

Accommodations. If you need special accommodations due to physical or learning disabilities, medical needs, religious practices, or other reasons, please inform us as soon as possible so we can work to accommodate those needs.

Late Assignments. Assignments and projects are due in class on the date posted. Late deliverables will not be accepted after the deadline, and a grade of 0 will be assigned.

Collaboration. While collaboration is encouraged for the Kaggle competition and final project, assignments should contain independent work. Individual deliverables for the Kaggle competition and final project should also be independent work.

Electronic devices in class. While phones or laptops will be used for participating in in-class polls and quizzes and are permissible for notetaking, using these devices for anything other than class business is not allowed. Devices may not be used to assist in finding answers for in-class quizzes. Violation of this policy will result in the loss of class participation points.