Assignment 1
Probability, Linear Algebra, and Computational Programming
Version 1: Updated 12/11/24
Instructions
Instructions for all assignments can be found here. Note: this assignment falls under collaboration Mode 2: Individual Assignment – Collaboration Permitted. Please refer to the syllabus for additional information. Please be sure to list the names of any students that you worked with on this assignment. Total points in the assignment add up to 90; an additional 10 points are allocated to professionalism and presentation quality.
Note: for all other assignments in this course, write out equations and math using markdown and LaTeX. I recommend that you complete the work on paper before typing up the final version. For this assignment only, you may alternatively hand-write your work for any parts of questions that require mathematical analysis. You will need to digitize them in and place them in the proper order for your final PDF. Either way, show your math including any intermediate steps necessary to understand the logic of your solution. If we are not able to interpret your meaning, no credit will be given.
Learning Objectives
The purpose of this assignment is to provide a refresher on fundamental concepts that we will use throughout this course and provide an opportunity to develop skills in any of the related skills that may be unfamiliar to you. Through the course of completing this assignment, you will…
- Refresh you knowledge of probability theory including properties of random variables, probability density functions, cumulative distribution functions, and key statistics such as mean and variance.
- Revisit common linear algebra and matrix operations and concepts such as matrix multiplication, inner and outer products, inverses, the Hadamard (element-wise) product, eigenvalues and eigenvectors, orthogonality, and symmetry.
- Practice numerical programming, core to machine learning, by applying it to scenarios of probabilistic modeling, linear algebra computations, loading and filtering data and plotting data.
- Apply your skills altogether through an exploratory data analysis to practice data cleaning, data manipulation, interpretation, and communication.
We will build on these concepts throughout the course, so use this assignment as a catalyst to deepen your knowledge and seek help with anything unfamiliar.
For references on the topics in this assignment, please check out the resources page on the course website for online materials such as books and courses to support your learning.
Note: don’t worry if you don’t understand everything in the references above - some of these books dive into significant minutia of each of these topics.
Exercise 1 - Probability
1A. Probabilistic Reasoning I. You are handed three fair dice and roll them sequentially. What’s the probability of the sum of the dice is 10 if the result of the roll of the first die is a 1?
1B. Probabilistic Computation I. Replicate the scenario in part A and simulate the the scenario by creating 1 million synthetic rolls of the three dice and determine what fraction of outcomes has an outcome of 1 from the first die and a sum of 10 across the three die.
1C. Probabilistic Reasoning II. A test for a rare disease has a 95% chance of detecting the disease if a person has it (true positive rate) and a 3% chance of wrongly detecting it if a person does not have it (false positive rate). If 1 in 1,000 people actually have the disease, what is the probability that a randomly chosen person who tests positive actually has the disease?
1D. Discrete Probability Theory. A discrete random variable \(X\) is distributed as follows (probability mass function):
\(P(X = x) = \begin{cases} 0.2 & x = -1 \\ 0.5 & x = 0 \\ 0.3 & x = 1 \end{cases}\)
What is the expected value, \(E_X[X]\) and variance, \(Var_X(X)\) of the random variable \(X\)?
Exercise 2 - Probability Distributions and Modeling
You’ve been asked to create a model of wait time for customers at the local Registry of Motor Vehicles. They’re open 8 hours a day and the maximum wait time is 8 hours (we won’t assume it’s possible not to be seen, assuming you’re willing to wait). Define the continuous random variable \(X = [\text{wait time for service as a fraction of 8 hours}]\) . This means that \(x=1\) represents a full day’s wait, or 8 hours, \(x=0.5\) represents half a day wait or 4 hours. Additionally, the valid values of \(X\) are between 0 and 1 (\(0\leq x \leq 1\)).
2A. Probability Density Functions (PDFs). Compute the value of \(\alpha\) that makes \(f_X(x)\) a valid probability density function:
\(f_X(x) = \begin{cases} \alpha e^{-x} & 0 \leq x \leq 1 \\ 0 & \text{else} \end{cases}\)
2B. Cumulative Distribution Functions (CDFs). Compute the cumulative distribution function (CDF) of \(X\), \(F_X(x)\), where \(F_X(x)=P(X<x)\) (here, \(P(\cdot)\) represents the probability of the event within the brackets). Be sure to indicate the value of the CDF for all values of \(x\in(-\infty,\infty)\).
2C. Expected Value. Compute the expected value of \(X\), \(E_X[X]\) Provide this value exactly (no approximations) and a numerical approximation to 3 decimal places. Also provide the approximate number of hours waiting (to 3 decimal places).
2D. Variance. Compute the variance of \(X\), \(Var(X)\) exactly (no approximation) and a numerical approximation to 3 decimal places. Using the variance, calculate the standard deviation. Express this standard deviation in both the original units and units of hours.
2E. Plotting your functions. Create plots of the PDF and CDF above as a subplot with one row and two columns.
2F. Computational probability for synthetic data. Create a numerical simulation of this process. EXPLAIN HOW THIS CAN BE DONE USING RANDOM VARIABLES AND CDF USING A UNIFORM DISTRIBUTION AND THE INVERSE OF THE CDF
Once you have your function to create synthetic samples, use it to generate 10,000 samples from the distribution. Using those samples, compute the mean and standard deviation of the values. How do they compare to the values you previously calculated analytically?
2G. Computing your simulated PDF and CDF. Repeat part E, but this time use your synthetic data only - do not assume a functional form for your data. Plot your empirical PDF and CDF. Plot these on the same axes as your original analytic functions and see how they compare.
2H. Computing probabilities. Having a way of generating synthetic data can allow us to easily compute probabilities. Compute the probabiities of the following events using the synthetic data samples that you generated:
- Wait time is less than 1 hour
- Wait time is more than 6 hours
- Wait time is between 3 and 5 hours
2I. Computing empirical PDF and CDF. You’re given some sample data of actual wait times experienced over a 1-year period. Using these data, create an empirical PDF and CDF and plot those as compared to the original model PDF and CDF above from A and B above. What is the mean and standard deviation of these empirical distributions? Discuss how similar the distributions appear to be and comment on whether or not you would recommend using the model from A and B for these data.
Exercise 3 - Linear Algebra Operations and Theory
3A. Matrix manipulations and multiplication. Machine learning involves working with many matrices and understanding what their products represent, so this exercise will provide you with the opportunity to practice those skills.
Let \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\), \(\mathbf{b} = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\), \(\mathbf{c} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\)
Compute the following by hand or indicate that it cannot be computed. Also note for each case whether the operation is an inner product, outer product, element-wise product, another type of operation. For any cases where an operation is invalid and cannot be computed, explain why it is invalid.
- \(\mathbf{A}\mathbf{A}\)
- \(\mathbf{A}\mathbf{A}^{\top}\)
- \(\mathbf{A}\mathbf{b}\)
- \(\mathbf{A}\mathbf{b}^{\top}\)
- \(\mathbf{b}\mathbf{A}\)
- \(\mathbf{b}^{\top}\mathbf{A}\)
- \(\mathbf{b}\mathbf{b}\)
- \(\mathbf{b}^{\top}\mathbf{b}\)
- \(\mathbf{b}\mathbf{b}^{\top}\)
- \(\mathbf{A}\circ\mathbf{A}\)
- \(\mathbf{b}\circ\mathbf{c}\)
- \(\mathbf{b}^{\top}\mathbf{b}^{\top}\)
- \(\mathbf{b} + \mathbf{c}^{\top}\)
- \(\mathbf{A}^{-1}\mathbf{b}\)
- \(\mathbf{b}^{{\top}}\mathbf{A}\mathbf{b}\)
- \(\mathbf{b}\mathbf{A}\mathbf{b}^{{\top}}\)
Note: The element-wise (or Hadamard) product is the product of each element in one matrix with the corresponding element in another matrix, and is represented by the symbol “\(\circ\)”.
3B. Matrix manipulations and multiplication using Python. Repeat the above, but this time using Python. If you are using a vector, make sure the dimensions of the vector match what you’d expect, for example, matrix \(\mathbf{b}\) is a \([2 \times 1]\) vector. In NumPy, unless you’re specify, you’ll like create a one-dimensional array of length 2 rather than a \([2 \times 1]\) vector if you don’t specify - be careful of this potential pitfall. Refer to NumPy’s tools for handling matrices. There may be circumstances when Python will produce an output, but based on the dimensions of the matrices involved, the linear algebra operation is not possible. Note these cases and explain why they occur. Please provide both the Python code AND the output of that code showing your result.
3C. Vector Norms. Norms are the effective lengths of vectors. For example, the Euclidean norm, or \(L_2\) norm, is one of several common norms and is calculated for a vector \[\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\] by \[||\mathbf{x}||_2 = \sqrt{\displaystyle \sum_{k=1}^n x_k^2}\]
What is the \(\ell_2\) norm of vectors \(\mathbf{b}\) and \(\mathbf{c}\) from part A?
3D. Orthogonality and unit vectors. Find all vectors that are orthonormal to \(\mathbf{d} = \begin{bmatrix} 3 \\ -5 \\ 0 \end{bmatrix}\). Vector \(\mathbf{x}_1\) is orthonormal to vector \(\mathbf{x}_2\) if (a) \(\mathbf{x}_1\) is orthogonal to vector \(\mathbf{x}_2\) AND \(\mathbf{x}_1\) is a unit vector. Orthogonal vectors are perpendicular, which implies that their inner product is zero. A unit vector is of length 1 (meaning its \(L_2\) norm is 1). We frequently encounter orthogonal and orthonormal vectors when dealing with dimensionality reduction, for example in Principal Components Analysis.
3E. Eigenvectors and eigenvalues. Eigenvectors and eigenvalues are useful for some machine learning algorithms, but the concepts take time to solidly grasp. They are used extensively in machine learning including in Principal Components Analysis (PCA) and clustering algorithms. For an intuitive review of these concepts, explore this interactive website at Setosa.io. Also, the series of linear algebra videos by Grant Sanderson of 3Brown1Blue are excellent and can be viewed on youtube here. For these questions, numpy may once again be helpful.
- Calculate the eigenvalues and corresponding eigenvectors of matrix \(\mathbf{B}= \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 5 \\ 3 & 5 & 6 \end{bmatrix}\)
- Choose one of the eigenvector/eigenvalue pairs, \(\mathbf{v}\) and \(\lambda\), and show that \(\mathbf{B} \mathbf{v} = \lambda \mathbf{v}\). This relationship extends to higher orders: \(\mathbf{B} \mathbf{B} \mathbf{v} = \lambda^2 \mathbf{v}\)
- Show that the eigenvectors are orthogonal to one another (e.g. their inner product is zero - just compute the inner product and show it is approximately 0). This is true for eigenvectors from real, symmetric matrices. In three dimensions or less, this means that the eigenvectors are perpendicular to each other. Typically we use the orthogonal basis of our standard x, y, and z, Cartesian coordinates, which allows us, if we combine them linearly, to represent any point in a 3D space. But any three orthogonal vectors can do the same. This property is used, for example, in PCA to identify the dimensions of greatest variation.
Exercise 4 - Numerical Programming with Data
Loading data and gathering insights from a real dataset. In data science, we often need to have a sense of the idiosyncrasies of the data, how they relate to the questions we are trying to answer, and to use that information to help us to determine what approach, such as machine learning, we may need to apply to achieve our goal. This exercise provides practice in exploring a dataset and answering question that might arise from applications related to the data.
Your objective. For this dataset, your goal is to answer questions A-E about electricity generation in the United States.
Data. The data for this problem can be found in the data
subfolder in the assignments
folder on github. The filename is a1_egrid2016.xlsx
. This dataset is the Environmental Protection Agency’s (EPA) Emissions & Generation Resource Integrated Database (eGRID) containing information about all power plants in the United States, the amount of generation they produce, what fuel they use, the location of the plant, and many more quantities. We’ll be using a subset of those data.
The fields we’ll be using include:
field | description |
---|---|
SEQPLT16 | eGRID2016 Plant file sequence number (the index) |
PSTATABB | Plant state abbreviation |
PNAME | Plant name |
LAT | Plant latitude |
LON | Plant longitude |
PLPRMFL | Plant primary fuel |
CAPFAC | Plant capacity factor |
NAMEPCAP | Plant nameplate capacity (Megawatts MW) |
PLNGENAN | Plant annual net generation (Megawatt-hours MWh) |
PLCO2EQA | Plant annual CO2 equivalent emissions (tons) |
For more details on the data, you can refer to the eGrid technical documents. For example, you may want to review page 45 and the section “Plant Primary Fuel (PLPRMFL)”, which gives the full names of the fuel types including WND for wind, NG for natural gas, BIT for Bituminous coal, etc.
There also are a couple of “gotchas” to watch out for with this dataset: - The headers are on the second row and you’ll want to ignore the first row (they’re more detailed descriptions of the headers). - NaN values represent blanks in the data. These will appear regularly in real-world data, so getting experience working with these sorts of missing values will be important.
Questions to answer:
4A. Which power plant generated the most energy in 2016 (measured in MWh)?
Which power plant produced the most CO2 emissions (measured in tons)?
Which power plant produced the most CO2 emissions (measured in tons)?
4B. What is the name of the northern-most power plant in the United States?
4C. What is the state where the northern-most power plant in the United States is located?
4D. Plot a bar plot showing the amount of energy produced by each fuel type across all plants.
4E. From the plot in (D), which fuel for generation produces the most energy (MWh) in the United States?
Which state has the largest number of hydroelectric plants? In this case, each power plant counts once so regardless of how large the power plant is, we want to determine which state has the most of them. Note the primary fuel for hydroelectric plants is listed as water in the documentation.
Which state(s) has generated the most energy (MWh) using coal? If there are more than one, list the state abbreviations in alphabetical order, separated with commas (but no spaces). You may also want to explore the documentation for the isin()
method for pandas. Note: in the eGrid documentation, there are multiple types of coal listed; be sure to factor in each type of coal.
Appendix: List of helpful identities
Below is a list of potentially helpful identities and equations for reference.
Identities and equations | Description |
---|---|
\(E_X[X] = \displaystyle \int_{-\infty}^{\infty} x f_X(x) dx\) | Expected value of continuous random variable \(X\) |
\(Var_X(X) = E_X[X^2]-E_X[X]^2\) | Variance of random variable \(X\) |
\(\displaystyle P( X \vert Y)= \frac{P(X \cap Y)}{P(Y)}\) | Conditional probability of event \(X\) given event \(Y\) has occurred |
\(\displaystyle P(Y \vert X)= \frac{P(X \vert Y)P(Y)}{P(X)}\) | Bayes’ Rule |
\(\displaystyle F_X(x) = \int_{-\infty}^{x} f_X(x) dx\) | CDF as a function of PDF |
\(\displaystyle f_X(x) = \frac{dF_X(x)}{dx}\) | PDF as a function of CDF |
\(P(X \leq x) = F_X(x)\) | Probabilistic definition of the CDF |
\(P(a < X \leq b) = F_X(b) - F_X(a)\) | Probability the \(X\) lies between \(a\) and \(b\) |
\(P(A) + P(\bar{A}) = 1\) | The sum of the probability of an event and its complement is 1 |
$P(Y) = P(Y X)P(X) + P(Y {X})P({X}) $ | Law of Total Probability |
\(\displaystyle \int e^{-x} dx = -e^{-x}\) | Indefinite integral |
\(\displaystyle \int x e^{-x} dx = -e^{-x}(x+1)\) | Indefinite integral |
\(\displaystyle \int x^2 e^{-x} dx = -e^{-x}(x^2+2x+2)\) | Indefinite integral |
\(\displaystyle \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad-cb} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\) | \(2 \times 2\) matrix inversion formula |