This course will begin with exploring the purpose of machine learning told through a discussion of the types of problems that machine learning can answer: description of the data – “what has happened?”, predictions based on the data – “what will happen?”, or prescription – “what should happen?” Then we will discuss the tools at our disposal to answer these questions, namely supervised learning including classification and regression; unsupervised learning including clustering and density estimation; and lastly reinforcement learning. There will be a strong focus on how to formulate a machine learning problem. Central to that formulation will be developing an understanding of how to preprocess data for analysis (normalization, cleaning, etc.), sampling and dimensionality reduction, feature and model selection, and performance evaluation with cross validation. The final topic of this course will be a brief overview of state-of-the-art machine learning techniques such as deep learning.
Throughout this course, the focus will be on applying algorithms rather than diving deeply into theory. You will be asked to consider the practical issues of machine learning problem solving: challenges of applying machine learning code packages, striving for parsimony and interpretability, and ensuring model assumptions are valid for a given problem and dataset. This course will also stress the importance of team-based collaboration, the value of producing fully reproducible and validated results, and tools to help with both such as version control and code repositories.
Communicating your results. Data science solutions are only as impactful as the communicator who shares them. Throughout this course all deliverables will include a Jupyter Notebook. This is an interactive writing and coding tool that allow you to combine formatted text, code and output from code including plots, as well as mathematical equations all in one location. Demonstrating competency in data science means (a) exhibiting a working knowledge of technical concepts including programming, statistics, and mathematics and (b) being able to clearly communicate the problem you were trying to solve or question you were trying to answer, why it matters, and how well your analysis worked. You will have opportunities to practice communicating what you learn and in doing so develop a stockpile of content that you could use to establish your own data science portfolio (for more ideas on creating a data science portfolio, follow this link).
Academic dishonesty. Adherence to the Duke Community Standard is expected. To uphold the Duke Community Standard:
I will not lie, cheat, or steal in my academic endeavors;
I will conduct myself honorably in all my endeavors; and
I will act if the Standard is compromised
Anyone found in violation of the Standard will be reported to the Office of Student Conduct.
Accommodations. If you need special accommodations due to physical or learning disabilities, medical needs, religious practices, or other reasons, please inform us as soon as possible so we can work to accommodate those needs.
Late Submissions. Assignments and projects are due in class by the start of class on the date posted. Late deliverables will ONLY be accepted at the discrescion of the instructor. Any late assignments will result in a reduction of at least 20 points off the grade. Course projects will not be accepted after the deadline. Quizzes are given at the beginning of each class and students are expected to be present in class for the quiz. While quizzes cannot be made up since the answers are discussed immediately after, the lowest two quizzes will be dropped at the end of the semester for each student to accommodate necessary absences. Please reach out to the TA's or instructor as early as possible to request any special accommodations.
Collaboration. While collaboration is encouraged for the Kaggle competition and final project, assignments should contain independent work. You are welcome to help each other, but your responses and solutions should all be your own on the assignments. No two assignments should have content that is identical, even in part.
Electronic devices in class. While phones or laptops will be used for participating in in-class polls and quizzes and are permissible for notetaking, using these devices for anything other than class business is not allowed. Violation of this policy will result in the loss of class participation points. Devices may NOT be used to assist in finding answers for in-class quizzes - this will be considered an honor code violation.