Final Project

IDS:705 Principles of Machine Learning

Summary and Goals

Machine learning tools are not an end in themselves, but yield value when making predictions, quantifying and describing phenomena in the world around us, and in all these ways and more helping us to make decisions that would otherwise be difficult or impossible. For this final project, you will work in teams to (1) identify a problem to solve or a question to answer, (2) apply machine learning techniques to conduct experiments to address the issues identified in (1), (3) rigorously evaluate the performance of your approach, and (4) clearly communicate your findings to a wide audience. The deliverables for this project are:

  1. Project proposal
  2. Final written report and a draft report prior to final submission
  3. Presentation. During our final class meeting we will have a project showcase and competition.
  4. Github repository for your project
  5. Peer evaluation

Other topics described in this document related to the project include: - Learning objectives - Submission, evaluation, & grading - Project ideas - Frequently asked questions

Learning Objectives

This project is an opportunity to identify and deeply explore a question or problem of your choosing, using machine learning tools. A central component of your project must be a machine learning methodology. It does not have to be one that we’ve explicitly discussed in class as you’re welcome to use the project as an opportunity to learn new topics, although there should be a supervised learning component to your project. The objectives of this project are to…

  1. Develop deeper competency in applying machine learning methods in practical applications
  2. Gain experience in learning more about a topi beyond what was explicitly discussed, but by building on the foundation you have developed throughout the course which enables you to learn about other machine learning concepts
  3. Increase your experience with collaborative data science workflows
  4. Expand your data science portfolio

In this project you will use what you’ve learned throughout this course and build on that knowledge and experience to apply the paradigms, algorithms, evaluation tools, and interpretation techniques discussed throughout the course. I strongly encourage you to pick a project that is of genuine interest in some way (e.g. the application, the tools, the dataset, etc.). Learning comes from stretching yourself: this requires that you push yourself into some unfamiliar territory and that is often a challenge and leads to desirable difficulty. Through this struggle is how the best learning happens, but it requires perseverance and that is best achieved when you are able to bring intrinsic motivation to that challenge. Find a topic of interest and embrace the challenge!

For this project you will identify a problem you wish to solve using machine learning tools. Identify the experiment you would need to run to evaluate how well you solved it as compared to existing approaches in the field including what metrics to use to evaluate performance.

Requirements

  • The project must involve supervised machine learning. You may include concepts we were not able to cover in the course. You may include other concepts at well, but there should be a supervised learning component.
  • The project must be able to be completed within the course of this semester and should be scoped correctly: we encourage you to be ambitious, but please visit office hours if you have questions about project scope.
  • Every project should involve reading about both your application domain and the methods that you’re using. A project on genetics should involve learning about and understanding the concepts that you’re using. You’re expected to develop some domain knowledge related to your problem and demonstrate that in the report.

Proposal

Your team will submit a short project proposal. You will receive feedback that should be used to guide your project development and execution. There are no length requirements on the proposal, but 2 pages should typically be sufficient. Every proposal should have the title of the project and the list of team members at the top of the first page.

You can find the project proposal template and instructions here. You are required to use the template for your proposal so that we can provide comments in Google docs. Please read through and discuss the different points mentioned in the template prior to submission.

Additionally, content from your proposal may be reused in your draft/final report and so you’re encouraged to invest in it with that in mind.

If you are looking for ideas about datasets, etc., please see the Ideas section below. Please stop by office hours if you would like to discuss specific project ideas or for any other help in selecting your project idea.

Final Report

The final project report that you submit will consist of two parts: (1) a draft project report and (2) a final report. The draft project report is your main opportunity to get detailed feedback on your report. While the draft report won’t be graded, we will provide written feedback and suggestions in the form of Google doc comments that we would strongly recommend addressing in your final report.

Please find the instructions and template for the final report here.

Presentation

You will also make a 3-minute presentation (strictly enforced) summarizing your project. This presentation should be visually compelling and should not miss the “forest for the trees” – don’t get lost in technical details. Imagine your aunt and uncle watching your presentation – would they know what is going on? Would they find it approachable and engaging? For inspiration for what makes an approachable discussion of a machine learning project, watch videos from the following series:

  • Two Minute Papers by Károly Zsolnai-Fehér. Concise 1-4 minute summaries of cutting edge research papers.
  • 3Blue1Brown by Grant Sanderson. Mathematical concepts conveyed clearly, intuitively, and visually.

Be sure to practice your presentation, ask your friends (especially those who may not be as technically inclined) for feedback. Do they think it was engaging/easy to follow? Ask them their takeaways: did they get the message you were trying to communicate? Address their feedback to help you ensure the quality of your presentation. You must create your presentation as a Google Slides presentation and share and submit the link to your presentation by 5pm the night before the showcase.

Github Repository

Your github respository should (a) contain a descriptive README.md file that explains what the repo is for, and how to use the code to reproduce your work (including how to set it up to run), (b) be well commented throughout all files, (c) list all dependencies in a requirements.txt file, (d) inform the user how to get the data and includes all preprocessing code, and (e) actually runs (i.e. we can successfully test it) and does what it says

Also include a copy of your final report and a link to your project video from the README.md file.

Peer Evaluation

Since this is a team project, you will also receive feedback from your teammates AND reflect on your own performance in a self-evaluation. You will be evaluating your fellow team members on the following criteria:

  1. Was dependable in attending meetings to work on the project
  2. Did work accurately and completely
  3. Completed work on time
  4. Contributed positively to team discussions
  5. Helped others when needed
  6. Responded to communications in a timely manner
  7. Treated other team members respectfully
  8. Demonstrated a positive attitude about the team and its work

This evaluation is NOT based directly on the scores that you receive in the feedback, but a satisfactory peer and self-evaluation is assessed based on the level of constructiveness of the feedback you provide. More detailed, constructive feedback is more useful to help your peers better understand their strengths and areas for growth. Doing so respectfully and compassionately is a requirement. Your peers will receive anonymized versions of the feedback that you share.

Submission, Evaluation, & Grading

You should submit each deliverable from your project through Gradescope. You will submit a link to each team deliverable. This should be submitted AS A TEAM not through individual submissions (points will be deducted if this is not followed). The project proposal, and draft final report should be submitted through GradeScope as links to Google Docs (so that we can attach easy-to-repond-to comments) using the templates provided. The link to the presentation slides and github repo should also both be submitted as links via GradeScope. The final project report, however, should be submitted as a PDF document in GradeScope.

The grading for this project will be assigned as follows:

Component Evaluation / Feedback Plan
Presentation 5 points, graded
Final Report 20 points, graded
Team Proposal Written feedback will be provided to help guide your project design.**
Draft Final Report Written feedback will be provided to help guide your final report writing.**
Github Repository Required for project submission to be considered complete.**
Peer Evaluation Required for project submission to be considered complete.**
Total 25 points

** No points will be directly assigned. One point will be deducted from your overall final project score for each day late; up to 2 points may be deducted from the overall project score (out of 25 possible points) if the deliverable is unsatisfactory (if it does not represent a serious effort towards the deliverable)

Ideas

  • Reproduce the work of a published study and build on it. Reproducing the results of a journal article can be a great way to dive into advanced materials. The goal for a project like this would be to reproduce the study and build on it in some way: test a new hypothesis, adjust the methodology, try it on other data that may present new and interesting challenges. Reproducing papers can be hard, so you’ll want to choose wisely and make clear what your innovation will be. As a starting place, you can explore Papers with Code which typically have papers where the code and the data are both shared, often making reproducing their work simpler. This is the recommended project type for teams in doubt.
  • Participate in an active machine learning competition. Online machine learning competitions are sponsored by organizations with a significantly high interest in a problem that they are investing prize money into finding a solution. Examples of competition platforms include Kaggle, Driven Data, Zindi, AICrowd, etc. If you choose to participate in a competition, it must be an active competition where your team can compete; it cannot be a “sample” competition that is only for learning to use the platform (e.g. the Kaggle Titanic competition, etc.). You will want to learn about the application domain. Note: these competitions often do the hard work of data preprocessing for you, so you are expected to generate a competitive submission - it doesn’t need to top the leaderboard, but you should be in the upper quartile of competitors.
  • Design your own project based on a question, e.g. how well buildings be detected in satellite imagery across diverse geographies? Satellite imagery is enabling us to create functional maps of the world based on the content in the images. Automating building identification could help map global population and analyze global population growth in real-time. However, different parts of the world look different: forests, deserts, plains, etc. Each location looks differently. This may impact the ability to train an algorithm on one location and test on another location. This project uses the INRIA building dataset to investigate the impact of different geographies on the performance of building detection and segmentation techniques using satellite imagery.
  • Build your own tool. Great value can come from making a tool available for use, but building the infrastructure is a challenge. You may want to create a chatbot that creates poetry based on themes that you feed in, or design a search tool that scans satellite data of the Earth for signs of natural disasters. The key here is that your tool will need to be functional and usable by your target audience.