Final Project

Summary and Goals

For this final project, you will work in teams to (1) identify a problem to solve or a question to answer, (2) apply machine learning techniques to conduct experiments to address the issues identified in (1), (3) rigorously evaluate the performance of your approach, and (4) clearly communicate your findings to a wide audience. The deliverables for this project are a final report, a presentation (in the form of a video) and a github repository for your project. Your team will submit a project proposal and one progress report at the midpoint of your project. During our final class meeting we will have a project video showcase and competition.

Every project must have a supervised learning component to it. You are encouraged to also have unsupervised learning components for exploring your datasets.

Contents

Learning objectives
Proposal
Progress Report
Final Report
Video
Roles & Responsibilities
Evaluation & Grading
Peer Evaluation
Ideas for datasets

Learning Objectives

This project is an opportunity to identify and deeply explore a question or problem of your choosing, using machine learning tools and push yourself and your team to develop innovative applications of those tools. The objectives of this project are to...

  1. Develop deeper competency in applying machine learning methods in practical applications
  2. Increase your experience with collaborative data science workflows
  3. Expand your data science portfolio
This project is an opportunity to use what you've learned throughout this course and apply the paradigms, algorithms, evaluation tools, and interpretation techniques discussed to a meaningful problem with you and your team guiding the project development. Second, data science often occurs in a team setting, so this project gives you experience working on a team and developing collaborative workflows you can speak to on your next interview. Third, this is meant to serve as an entry into your professional data science portfolio to help ready you for career opportunities ahead.

Proposal

Your team will submit a short project proposal. Your project proposal will be up to 3 pages and will include the following:

  1. Title for the project, assigned team number, and the names of each team member
  2. What is the problem you're trying to solve or question you're trying to answer?
  3. Why is this interesting or worth pursuing?
  4. Which dataset(s) do you plan to use for this project? (please include links)
  5. What is your proposed machine learning approach and how you will evaluate your performance?
  6. What roles will each team member be taking on for the three categories listed (report writing, project management, and machine learning contributions)? Each team member should have 3 roles articulated.
  7. Include a list of references that you plan to read to further your knowledge of this problem
  8. Weekly meeting time for your project - what day and time will you meet. You will be asked to report a list of dates and times that you have met in your project timeline, so please keep track of these meetings.

If you’re looking for inspiration on how to select a project idea, please see the Ideas section. Please stop by office hours if you’d like to discuss specific project ideas or for any other help in selecting your project idea.

Progress Report

Halfway through the allotted time for your project, your team will submit a progress report to provide an update on the status of the project and describe any obstacles encountered that may have adjusted the project trajectory. This document should be between 1000 and 2000 words and may be no longer than 5 pages. This report should include the following content:

  1. Title for the project, assigned team number, and the names of each team member
  2. Briefly describe the goals of the project (include a short reminder of the dataset, methods, and performance evaluation you're proposing - this should cover all of the proposal questions, but more succinctly and updated with the knowledge you've gained working on the project so far)
  3. Describe the results of your initial work. This should include relevant figures, tables, and/or other methods of conveying performance evaluation. ROC and PR curves should be present is these were proposed methods of performance evaluation.
  4. Are there any major challenges you are facing in this project? Please mention those challenges and how you plan to overcome them or rescope the goals of the project to accommodate them.
  5. A list of references that you have read and those you plan to read to deepen your knowledge of this problem

Final Report

The final project report that you submit will consist of two parts: (1) a written project report and (2) a 5 minute video communicating the key takeaways from your project.

  1. Header. Include your title, team number, and the names of each team member
  2. Abstract. [150 words maximum] This should be the one paragraph that captures the significance of what you did and why you did it.
  3. Introduction. Provide a description of the problem and the value in finding a solution, motivate your reader as to why he/she should care about this question. The idea is to get your reader excited about the solution you are about to present.
  4. Background. This section should cite problems that have been previously addressed that relate to your work, and the key takeaways of the studies that explored that work. The idea here is to place the problem you’re working on in context and to let the reader know that you’re not working in a knowledge vacuum. For finding relevant literature, a good starting point is Google Scholar.
  5. Data. Describe and visualize your data. Make sure every caption fully describes the figure. You may want to visualize the raw data and/or extracted features. What challenges are inherent to this problem? How might they be overcome? What take away messages can you get simply from visualizing your data?
  6. Methods. Present your machine learning solution (a description of any preprocessing, feature extraction, classification/regression techniques) and why you made each of the choices you did. Discuss any methods that you didn’t create yourself and please cite relevant literature to support your claims. Also include a flow chart of your methodology to the reader can easily conceptualize your solution. The flow chart of the overall experimental design should clearly articulate your process (example). Additionally, for multiple experimental conditions or applications, they should each be represented in your flowchart. Describe your approach to measuring generalization performance, what metric(s) you used and why. Imagine that you are writing this section so that someone could recreate your results.
  7. Results. Include a complete performance assessment that includes your validation approach (cross validation, train/validate/test split, etc.) and the key metrics of performance for the problem (ROC curves, PR curves, confusion matrices if applicable, etc.). You should also compare your outcomes to at least one baseline model (a simple model to exhibit your improvements) in addition to comparison against random chance guessing in the classification setting. This section should be supported with visualizations including examples where your method worked well, examples where it failed, and hypotheses supported by evidence as to why in each case.
  8. Conclusions. It’s critical to have a strong ending and not just let the energy fizzle out of the report. Many readers, if pressed for time, will simply read your abstract and your conclusions. In fact, you may want to start by writing your conclusions. Very succinctly recap the problem you were studying and what was your approach to the solution. Focus on explaining the key takeaways from your work - these should not be merely a set of bullet points, but fleshed out conclusions. As you're writing your conclusions think about if the reader took nothing else away from reading your report, what would you want them to know most? Did you identify one particular approach that worked well? Was there a challenge that you faced that opens the door to working on solving a new problem? What avenues of research would you pursue next?
  9. Roles. Since this is a team project, we want to know what your specific contribution was to this project. Provide detail on your individual role and how it contributed to the competition. Each team member should clearly articulate an individual role.
  10. Timeline of activity. This timeline articulates the various milestones for your project and demonstrates consistent effort over the course of the project from the time of the proposal through submission. This can be a bulleted list with milestones and completion dates and should simply be a compiled list from the individual in the project coordinator role collected from weekly meetings.
  11. References [no word limits]. An alphabetical list of references cited in this work. A minimum of 15 are required. Consider using the Zotero citation manager for collecting and compiling your references. These should primarily be research papers and technical reports.

You will submit your report via Gradescope and should meet the following requirements

Video

You will also submit an up-to-5 minute video summarizing your project. This video should be visually compelling and should not miss the “forest for the trees” – don’t get lost in technical details. Imagine your aunt and uncle watching this video – would they know what’s going on? Would they find it approachable and engaging? For inspiration for what makes a good explanatory video, watch videos from the following series:

Once you're working on producing your video, ask your friends (especially those who may not be as technically inclined) for feedback. Do they think it was engaging/easy to follow? Ask them their takeaways: did they get the message you were trying to communicate? Address their feedback to help you ensure the quality of your video. You're encouraged to use the audio-visual medium to the fullest to clearly present your project.

You'll submit your video as a .mp4 file to the instructional team (please test your file to make sure it plays before submitting).

Roles & Responsibilities

Each member of your team will articulate clearly specified, complementary roles. Each team member will have four responsibilities:

  1. Individual report contributions
  2. Individual project management contributions
  3. Machine learning experimental condition contribution
  4. Team support

For the individual contributions of roles 1-3, each team member selects one of the roles listed below in each section (only one per team member)

We'll walk through each of these four roles.

Role #1: Individual report contribution

The team will select one member of the team to take on each of the following technical roles (a fifth role only being required for teams of 5).

Please note: these recommended roles are meant to generally guide your assignments of roles within the team. You can modify these to meet the needs of your team, but each role should have a clearly distinguishable output that they are responsible for and this should be articulated in your project proposal.

Role ID Primary Responsibility Description
1-A Abstract, Introduction, and Conclusions These sections are critical for effectively communicating a project. The abstract gives us the summary in brief that includes key conclusions as well and the introduction details the motivation for the work. The conclusions are the heart of any project that is well done. For any activity to have impact, the team needs to be able to successfully communicate what was done and why it was important and/or what was learned. This is more than a restatement of the results, but interprets the results and places them in the context of other work or potential impact on application domains. Conclusions must be grounded in the results from the work, attuned to the contributions of the project and are based on the background literature. The conclusions are expected to give meaning and significance to those results.
1-B Background and References You're in charge of putting the project in context - what is the problem / question and why does it matter. You're in charge of putting context around the project by reading up on similar work, and describing what has been done in this space. You'll dive deeply into interesting projects that have been conducted and synthesize their key takeaways, connecting that to the presentation of results and ensuring that the results are placed in context. You'll also ensure the the references section of the report is compliant and that other team members are contributing references and incorporating citations.
1-C Data: Preprocessing and Exploration As part of writing up this section, you'll also be leading an exploratory data analysis of the dataset, data cleaning and preparation, and any relevant feature extraction. You're in charge of ensuring the data are preparing for the experimental conditions and applications that of your project. There should be visualizations that describe your data included in this section.
1-D Methods Your job is to articulate the the approach the team takes to solve the problem or answer the question at the core of the project. In this section, any experimental conditions or applications should be clearly articulated including a flowchart. Whil
1-E Results You're in charge of collecting, compiling, and presenting performance metrics, and make sure they make sense across the results from the different workstreams. You will also be leading the creation of any ROC curves, PR curves, tables of results, and any other visualizations of outputs. While multiple team members may produce them, so you are not expected to make all the figures yourself, your role is to help to bring them together into a coherent discussion of the results, combine plots when appropriate for comparison, etc. You may need to guide the team to what is needed to make this section compelling. You will lead the results section and be responsible for providing a coherent assessment of the results of your work.

Role #2: Project management role

Role ID Primary Responsibility Description
2-A Project coordinator The many moving parts to this project require coordination - you're it. Your responsibilities include coordinating and convening weekly meetings on the project as well as facilitating an agenda for each meeting. You work with the team to create timelines for all of the key deliverables and hold the team accountable to them. As an appendix to the final report you will present a timeline of project activity articulating progress across the semester.
2-B Proposal and progress report lead You will lead the work on the proposal and progress reports. This DOES NOT mean that you are the only person to contribute content, although you will likely contribute the most to these documents. Every team member is required to contribute to these progress reports and it's your job to make sure all the content comes together coherently. You will guide the team through the development of the project proposal and the progress report. You are responsible for the final review, quality assurance, and submission.
2-C Report integrator lead You're in charge of putting the pieces together in the report and that the report sounds like it was written in one unified voice. You will need to review the content and make sure the pieces flow reasonably well and work with the authors to improve the content and help with rewriting that may be necessary. In the end, you are responsible for ensuring the report is logical in flow.
2-D Video lead Consider yourself the director of the video. Your roles is to lead the team in effectively communicating your results in an accurate and engaging manner. Well-before you have your final results from the project, you can begin assembling the project motivation and the problem description, then add your findings of your project as the work is completed.
2-E Github lead Great github repositories enable other data scientists to build on your work and have the added benefit of bringing significant attention to your work and abilities. You are responsible for ensure that the final project content is well organized, documented, and commented, and could be easily used by another data scientists outside your team to reproduce your work. You will ensure the rest of the team contributes their content and ensure consistent quality across all components of the repository. What the github repo looks like on final submission is ultimately your responsibility.

Role #3: Executing a machine learning experimental condition

Every member of the team is required to train and apply a machine learning algorithm some of the data for the project. This will typically be an experimental condition for the project and responsible for writing up their assessment of the results of that experiment. Every member of the team will have a personal piece to work on related to apply the techniques from the course to the project.

Role #4: Supporting your teammates on achieving project success

While the above roles are detailed to provide your team with structure (much like you are likely to receive as work assignments on a job), these roles are meant to empower you to accomplish more together. Not ever possible scenario or requirement of this project is covered by one of the above roles (although the coverage is quite high). The most important aspect of a successful team project is to be there to support one another. Be generous with your time as you're able and always be respectful.

Evaluation

The expectation of this project is to apply the techniques that you learned from this course to your application. Following the methodologies we discussed carefully, exercising rigor in ensuring the correct interpretation of your results, and clearly and accurately communicating those results is what is key to success. It's OK if the results and your algorithm's performance are not exceptional, as long as you followed the procedures we outlined carefully to ensure that you have produced results, that your results are trustworthy, and you properly interpreted your findings. You're encouraged to reach out to the instructional team regularly for questions about your project as they arise.

The grading for this project will be assigned as follows:

Component Weight Description
Team Proposal 5% The two criteria that will be assessed are whether all of the requested content was included and whether or not the proposal seems well thought out and includes a reasonable plan. If these criteria are met, you will receive full credit. You may receive feedback on the proposal suggesting adjustments to your project plan as feedback. Early discussions on that feedback with the instructor and TA's are encouraged.
Progress Report 5% As long as you included the content requested in the progress report and your team demonstrates that progress proportional to the halfway point of the project has been reached, then your team will receive full credit.
Final Report 60% Of the 60% for the final report, it will be assigned as follows:
  1. 15% Individual team roles (5% each)
  2. 45% Team grade for the overall report
Video Presentation 15% The goal of the video is to quickly present your motivation, methodology, and findings to a general audience. The video should tell a clear story and should not miss the “forest for the trees” – don’t get lost in technical details. Is the content clear, accurate, and engaging?
Github Repository 5% Your github repo will be evaluated on whether it (a) contains a descriptive README.md file that explains what the repo is for, and how to use the code to reproduce your work (including how to set it up to run), (b) is well commented throughout all files, (c) lists all dependencies in a requirements.txt file, (d) informs the user how to get the data and includes all preprocessing code, and (e) it actually runs (i.e. we can successfully test it) and does what it says
Peer Evaluation 10% All criteria in the peer evaluation are equally weighted the evaluation is averaged across team members.
Total 100%

Peer Evaluation

Since this is a team project, you will also be evaluated by your teammates (and yourself). This is a chance to offer each other feedback and reflect on your own performance. You will be rating your fellow team members on the following criteria:

  1. Was dependable in attending meetings to work on the project
  2. Did work accurately and completely
  3. Completed work on time
  4. Contributed positively to team discussions
  5. Helped others when needed
  6. Responded to communications in a timely manner
  7. Treated other team members respectfully
  8. Demonstrated a positive attitude about the team and its work

Project Ideas

As you're developing ideas for your project, explore active competitions on AICrowd, Zindi, Kaggle, DrivenData, and other machine learning competition pages. You can use these competitions as a starting point for a project. Additionally, you may want to be inspired by projects in the community, for example, the AI for Good repository has a number of projects from which to draw inspiration.

Dataset Ideas

What makes for an interesting dataset to explore? The dataset generally needs to have enough samples, features, and labels to enable a meaningful analysis. This rules out options like the Iris, Titanic, and all other "introductory" datasets for which you can find dozens of numerous tutorials walking through the analysis. You want to be able to journey into the unknown of the data: be bold and pick a dataset and application that excites you!

Potential sources for datasets: