This course will begin with exploring the purpose of machine learning told through a discussion of the types of problems that machine learning can answer: description of the data – “what has happened?”, predictions based on the data – “what will happen?”, or prescription – “what should happen?” Then we will discuss the tools at our disposal to answer these questions, namely supervised learning including classification and regression; unsupervised learning including clustering and density estimation; and lastly reinforcement learning. There will be a strong focus on how to formulate a machine learning problem. Central to that formulation will be developing an understanding of how to preprocess data for analysis (normalization, cleaning, etc.), sampling and dimensionality reduction, feature and model selection, and performance evaluation with cross validation. The final topic of this course will be a brief overview of state-of-the-art machine learning techniques such as deep learning.
Throughout this course, the focus will be on applying algorithms rather than diving deeply into theory. You will be asked to consider the practical issues of machine learning problem solving: challenges of applying machine learning code packages, striving for parsimony and interpretability, and ensuring model assumptions are valid for a given problem and dataset. This course will also stress the importance of team-based collaboration, the value of producing fully reproducible and validated results, and tools to help with both such as version control and code repositories.
Communicating your results. Data science solutions are only as impactful as the communicator who shares them. Throughout this course all deliverables will include a Jupyter Notebook. This is an interactive writing and coding tool that allow you to combine formatted text, code and output from code including plots, as well as mathematical equations all in one location. Demonstrating competency in data science means (a) exhibiting a working knowledge of technical concepts including programming, statistics, and mathematics and (b) being able to clearly communicate the problem you were trying to solve or question you were trying to answer, why it matters, and how well your analysis worked. You will have opportunities to practice communicating what you learn and in doing so develop a stockpile of content that you could use to establish your own data science portfolio (for more ideas on creating a data science portfolio, follow this link).
Academic dishonesty. Adherence to the Duke Community Standard is expected. To uphold the Duke Community Standard:
I will not lie, cheat, or steal in my academic endeavors;
I will conduct myself honorably in all my endeavors; and
I will act if the Standard is compromised
Anyone found in violation of the Standard will be reported to the Office of Student Conduct.
Accommodations. If you need special accommodations due to physical or learning disabilities, medical needs, religious practices, or other reasons, please inform us as soon as possible so we can work to accommodate those needs.
Late Assignments. Assignments and projects are due in class on the date posted. Late deliverables will not be accepted after the deadline, and a grade of 0 will be assigned.
Collaboration. While collaboration is encouraged for the Kaggle competition and final project, assignments should contain independent work. Individual deliverables for the Kaggle competition and final project should also be independent work.
Electronic devices in class. While phones or laptops will be used for participating in in-class polls and quizzes and are permissible for notetaking, using these devices for anything other than class business is not allowed. Devices may not be used to assist in finding answers for in-class quizzes. Violation of this policy will result in the loss of class participation points.