In this homework, we are going to implement a kNN classifier and get some practice with
the concepts in statistical estimation we went over in lecture. The assignment will help you understand how to apply
the concepts from lecture to applications.
How to Do This Assignment.
• Each question that you need to respond to is clearly marked in a blue “Task Box” with its corresponding
point-value listed.
• We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t
read your work, they can’t give you credit. Submit your solutions to Canvas as a zip including your report and
any code the assignment asks you to develop.
• Programming should be done in Python and numpy. If you don’t have Python installed in your machine, you
can install it from https://www.python.org/downloads/. This is also the link showing how to install numpy:
https://numpy.org/install/. You can also search through the internet for numpy tutorials if you haven’t used it
before. Google is your friend!
You are NOT allowed to…
• Use machine learning package such as sklearn.
• Use data analysis package such as panda or seaborn.
• Discuss low-level details or share code / solutions with other students.
Advice. Start early. Start early. Start early. There are two sections to this assignment – one involving working with
math (20% of grade) and another focused more on programming (80% of the grade). Read the whole document
before deciding where to start.
How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw1-report.pdf.
You will also submit test set predictions to a class Kaggle. This is a required to receive credit for Q10.
1 Statistical Estimation [10pts]
The Poisson distribution is a discrete probability distribution over the natural numbers starting from zero (i.e. 0,1,2,3,…).
Specifically, it models how many occurrences of an event will happen within a fixed interval. For example – how many
people will call into a call center within a given hour. Importantly, it assumes that an individual event (a person calling
the center) is independent and occurs with a fixed probability. Say there is a 2% chance of someone calling every
minute, then on average we would expect 1.2 people per hour – but sometimes there would be more and sometimes
fewer. The Poisson distribution models that uncertainty over the total number of events in an interval.
The probability mass function for the Poisson with parameter λ is given as:
Pois(X = x; λ) = λ
x
e
−λ
x!
∀x ∈ {0, 1, 2, …} λ ≥ 0 (1)
where the rate λ can be interpreted as the average number of occurrences in an interval. In this section, we’ll derive
maximum likelihood estimates for λ and show that the Gamma distribution is a conjugate prior to the Poisson.
1
I Q1 Maximum Likelihood Estimation of λ [4pts]. Assume we observe a dataset of occurrence counts
D = {x1, x2, …, xN } coming from N i.i.d random variables distributed according to Pois(X = x; λ). Derive
the maximum likelihood estimate of the rate parameter λ. To help guide you, consider the following steps:
1. Write out the log-likelihood function logP(D|λ).
2. Take the derivative of the log-likelihood with respect to the parameter λ.
3. Set the derivative equal to zero and solve for λ – call this maximizing value λˆMLE.
Your answer should make intuitive sense given our interpretation of λ as the average number of occurrences.
In lecture, we discussed how to use priors to encode beliefs about parameters before any data is seen. We showed
the Beta distribution was a conjugate prior to the Bernoulli – that is to say that the posterior of a Bernoulli likelihood
and Beta prior is itself a Beta distribution (i.e. Beta ∝ Bernoulli * Beta) . The Poisson distribution also has a conjugate
– the Gamma distribution. Gamma is a continuous distribution over the range [0, ∞) with probability density function:
Gamma(Λ = λ; α, β) = β
αλ
α−1
e
−βλ
Γ(α)
∀λ ≥ 0 α, β > 0 (2)
I Q2 Maximum A Posteriori Estimate of λ with a Gamma Prior [4pts]. As before, assume we observe
a dataset of occurrence counts D = {x1, x2, …, xN } coming from N i.i.d random variables distributed according to Pois(X = x; λ). Further, assume that λ is distributed according to a Gamma(Λ = λ; α, β). Derive
the MAP estimate of λ. To help guide you, consider the following steps:
1. Write out the log-posterior logP(λ|D) ∝ logP(D|λ) + logP(λ).
2. Take the derivative of logP(D|λ) + logP(λ) with respect to the parameter λ.
3. Set the derivative equal to zero and solve for λ – call this maximizing value λˆMAP .
In the previous question we found the MAP estimate for λ under a Poisson-Gamma model; however, we didn’t
demonstrate that Gamma was a conjugate prior to Poisson – i.e. we did not show that the product of a Poisson
likelihood and Gamma prior results in another Gamma distribution (or at least is proportional to one).
I Q3 Deriving the Posterior of A Poisson-Gamma Model [2pt]. Show that the Gamma distribution is
a conjugate prior to the Poisson by deriving expressions for parameters αP , βP of a Gamma distribution such
that P(λ|D) ∝ Gamma(λ; αP , βP ).
[Hint: Consider P(D|λ)P(λ) and group like-terms/exponents. Try to massage the equation to looking like
the numerator of a Gamma distribution. The denominator can be mostly ignored if it is constant with respect to λ as we are only trying to show a proportionality (∝).]
2 k-Nearest Neighbor (kNN) [50pts]
In this section, we’ll implement our first machine learning algorithm of the course – k Nearest Neighbors. We are
considering a binary classification problem where the goal is to classify whether a person has an annual income more
or less than $50,000 given census information. Test results will be submitted to a class-wide Kaggle competition. As
no validation split is provided, you’ll need to perform cross-validation to fit good hyperparameters.
Data Analysis. We’ve done the data processing for you for this assignment; however, getting familiar with the data
before applying any algorithm is a very important part of applying machine learning in practice. Quirks of the dataset
could significantly impact your model’s performance or how you interpret the outcome of your experiments. For
instance, if I have an algorithm that achieved 70% accuracy on this task – how good or bad is that? We’ll see!
We’ve split the data into two subsets – a training set with 8,000 labelled rows, and a test set with 2,000 unlabelled
rows. These can be downloaded from the data tab of the HW1 Kaggle as “train.csv” and “test_pub.csv” respectively.
Both files will come with a header row, so you can see which column belongs to which feature.
Below you will find a table listing the attributes available in the dataset. We note that categorizing some of these
2
attributes into two or a few categories is reductive (e.g. only 14 occupations) or might reinforce a particular set of social
norms (e.g. categorizing sex or race in particular ways). For this homework, we reproduced this dataset from its source
without modifying these attributes; however, it is useful to consider these issues as machine learning practitioners.
attribute name: type. list of values
id: numerical. Unique for each point. Don’t use this as a feature (it will hurt, badly).
age: numerical.
workclass: categorical. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay
education-num: ordinal. 1:Preschool, 2:1st-4th, 3:5th-6th, 4:7th-8th, 5:9th, 6:10th, 7:11th, 8:12th, 9:HS-grad,
10:Some-college, 11:Assoc-voc, 12:Assoc-acdm, 13:Bachelors, 14:Masters, 15:Prof-school, 16:Doctorate
marital-status: categorical. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent,
Married-AF-spouse
occupation: categorical. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlerscleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, ArmedForces
relationship: categorical. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race: categorical. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex: categorical. 0:Male, 1:Female
capital-gain: numerical.
capital-loss: numerical.
hours-per-week: numerical.
native-country: categorical. United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(GuamUSVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,
Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua,
Scotland, Thailand, Yugoslavia, El-Salvador, Trinada&Tobago, Peru, Hong, Holand-Netherlands
income: ordinal. 0: <=50K, 1: >50K This is the class label. Don’t use this as a feature.
Our dataset has three types of attributes – numerical, ordinal, and nominal. Numerical attributes represent
continuous numbers (e.g. hours-per-week worked). Ordinal attributes are a discrete set with a natural ordering, for
instance different levels of education. Nominal attributes are also discrete sets of possible values; however, there is no
clear ordering between them (e.g. native-country). These different attribute types require different preprocessing. As
discussed in class, numerical fields have been normalized.
For nominal variables like workclass, marital-status, occupation, relationship, race, and native-country, we’ve transformed these into one column for each possible value with either a 0 or a 1. For example, the first instance in the training
set reads: [0, 0.136, 0.533, 0.0, 0.659, 0.397, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0] where all the zeros and ones correspond to these binarized variables. The following questions guide you through
exploring the dataset and help you understand some of the steps we took when preprocessing the data.
I Q4 Encodings and Distance [2pt] To represent nominal attributes, we apply a one-hot encoding technique – transforming each possible value into its own binary attribute. For example, if we have an attribute
workclass with three possible values Private, State-gov, Never-worked – we would binarize the workclass attribute as shown below (each row is a single example data point):
workclass
Private
State-gov
Never-worked
State-gov
The original field
workclass=
Private
workclass=
State-gov
workclass=
Never-worked
1 0 0
0 1 0
0 0 1
0 1 0
Fields after binarization
A common naive preprocessing is to treat all categoric variables as ordinal – assigning increasing integers
to each possible value. For example, such an encoding would say 1=Private, 2=State-gov, and 3=Neverworked. Contrast these two encodings. Focus on how each choice affects Euclidean distance in kNN.
3
I Q5 Looking at Data [3pt] What percent of the training data has an income >50k? Explain how
this might affect your model and how you interpret the results. For instance, would you say a model that
achieved 70% accuracy is a good or poor model? How many dimensions does each data point have (ignoring
the id attribute and class label)? [Hint: check the data, one-hot encodings increased dimensionality]
2.1 k-Nearest Neighbors
In this part, you will need to implement the k-NN classifier algorithm with the Euclidean distance. (If you are not
familiar with Euclidean distance, you can go back to the L1.2 – kNN slide 18)
The kNN algorithm is simple as you already have the preprocessed data. To make a prediction for a new data
point, there are three main steps:
1. Calculate the distance of that data point to every training example.
2. Get the k nearest points (i.e. the k with the lowest distance).
3. Return the majority class of these neighbors as the prediction
Implementing this using native Python operations will be quite slow, but lucky for us there is the numpy package.
numpy makes matrix operations quite efficient and easy to code. How will matrix operations help us? The next question
walk you through figuring out the answer. Some useful things to check out in numpy: broadcasting, np.linalg.norm
np.argsort() or np.argpartition(), and array slicing.
I Q6 Norms and Distances [2pt] Distances and vector norms are closely related concepts. For instance,
an L2 norm of a vector x (defined below) can be intepretted as the Euclidean distance between x and the
zero vector:
||x||2 =
vuutX
d
i=1
x
2
i
(3)
Given a new vector z, show that the Euclidean distance between x and z can be written as an L2 norm.
[kNN implementation note for later, you can compute norms efficiently with numpy using np.linalg.norm]
In kNN, we need to compute distance between every training example xi and a new point z in order to make a
prediction. As you showed in the previous question, computing this for one xi can be done by applying an arithmetic
operation between xi and z, then taking a norm. In numpy, arithmetic operations between matrices and vectors are
sometimes defined by “broadcasting”, even if standard linear algebra doesn’t allow for them. For example, given a
matrix X of size n × d and a vector z of size d, numpy will happily compute Y = X − z such that Y is the result of
the vector z being subtracted from each row of X. Combining this with your answer from the previous question can
make computing distances to every training point quite efficient and easy to code.
I Q7 Implement kNN [10pts] Okay, enough math and talk about numpy. It is time to get our hands dirty.
Let’s write some code! Implement k-Nearest Neighbors using Euclidean distance for this dataset in Python in
a file named knn.py. Conceptually, you will need to load the training and test data and have some function
to find the nearest neighbors for each test point to make a prediction.
2.2 K-Fold Cross Validation
We do not provide class labels for the test set or a validation set, so you’ll need to implement K-fold Cross Validation1
to check if your implementation is correct and to select hyperpameters. As discussed in class, K-fold Cross Validation
divides the training set into K segments and then trains K models – leaving out one of the segments each time and
training on the others. Then each trained model is evaluated on the left-out fold to estimate performance. Overall
performance is estimated as the mean of these K fold performances.
1Note that K here has nothing to do with k in kNN – just overloaded notation.
4
I Q8 Implement 4-fold Cross Validation [8pts] Inside of your knn.py file, implement 4-fold cross validation for your kNN algorithm. Your implementation of cross validation should return the observed accuracies for each fold. We’ll use this in the next part.
I Q9 Hyperparameter Search [15pt] To search for the best hyperpameters, we will use cross-validation to
estimate our accuracy on new data. For each k in 1,3,5,7,9,99,999,and 8000, report:
• accuracy on the training set when using the entire training set for kNN (call this training accuracy),
• the mean and variance of the 4-fold cross validation accuracies (call this validation accuracy).
What is the best number of neighbors (k) you observe? When k = 1, is training error 0%? Why or why not?
What trends (train and cross-valdiation accuracy rate) do you observe with increasing k? Do they relate to
underfitting and overfitting?
2.3 Kaggle Submission
Great work getting here. In this section, you’ll submit the predictions of your best model to the class-wide Kaggle
competition. You are free to make any modification to your kNN algorithm to improve performance; however, it must
remain a kNN algorithm! You can change distances, weightings, select subsets of features, etc.
I Q10 Kaggle Submission [10pt]. Submit a set of predictions to Kaggle that outperforms the baseline on
the public leaderboard. To make a valid submission, use the train set to build your kNN classifier and then
apply it to the test instances in test_pub.csv available from Kaggle’s Data tab. Format your output as a
two-column CSV as below:
id,income
0,0
1,0
2,0
3,0
4,0
5,1
6,0
.
.
.
You may submit up to 5 times a day. In your report, tell us what modifications you made to kNN for your
final submission.
Extra Credit and Bragging Rights [2.5pt Extra Credit]. The TA has made a submission to the leaderboard. Any
submission outperforming the TA on the private leaderboard at the end of the homework period will receive 2.5 extra
credit points on this assignment. Further, the top 5 ranked submissions will “win HW1” and receive bragging rights.
3 Debriefing (required in your report)
1. Approximately how many hours did you spend on this assignment?
2. Would you rate it as easy, moderate, or difficult?
3. Did you work on it mostly alone or did you discuss the problems with others?
4. How deeply do you feel you understand the material it covers (0%–100%)?
5. Any other comments?
Classification, CS434, Estimation, Homework, k-Nearest, Neighbor, solved, Statistical
[SOLVED] Cs434 homework 1 k-nearest neighbor classification and statistical estimation
$25
File Name: Cs434_homework_1_k_nearest_neighbor_classification_and_statistical_estimation.zip
File Size: 725.34 KB
Reviews
There are no reviews yet.