CS/ECE/STAT-861: Theoretical Foundations of Machine Learning
University of WisconsinMadison, Fall 2023
Homework 0. Due 9/15/2023, 11.00 am
Instructions:
-
Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.
-
I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
-
You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
-
One of the objectives of Homework 0 is to test if you are familiar with the necessary background topics to take this class. While I do not expect you to know the solution right away, you should be able to solve most of the questions with reasonable effort after looking up any necessary references. If you find this homework exceedingly difficult (especially problems 1 and 2), please talk to me.
-
Collaboration: You are allowed to collaborate on problem 3 of this homework in groups of size up to 3. If you do so, please indicate your collaborators at the top of your solution. You are not allowed to collaborate on problems 1 and 2.
-
Estimating the mean of a normal distribution
N
You are given n independent samples X1, . . . , Xn sampled from a normal distribution (, 2) with unknown mean
, but known variance 2. You wish to estimate the mean . One straightforward option is to estimate using the
^
sample mean SM
, defined below:
1
^SM = n
i=1
Xi.
n
-
[2 pts] To quantify the performance of this estimator, we define the risk R, which is simply the expected squared error of an estimator,
^ ^
R(SM, ) = E[(SM )2].
Here, the expectation E is with respect to the randomness in the data. Show that the risk is 2/n for the sample mean estimator.
-
[4 pts] (Concentration) While the risk measures how well an estimator does in expectation, sometimes we also
^
wish to know that SM is within some margin of error of the true mean with high probability. Prove the
following result for any > 0:
P(|^SM | > ) 2 exp
n2
22 .
where the probability P is with respect to the randomness in the data.
You may use the following facts about normal random variables:
i=1
-
If X1, . . . , Xn are normal, then so is n Xi. (You will need to compute the mean and variance.)
-
If X are normal, then so is aX for any a R. (You will need to compute the mean and variance.)
2/2
-
If Z N(0, 1) is a standard normal random variable, then P(|Z| > ) 2e .
-
-
[2 pts] (Sample complexity) Suppose you are given some > 0 and (0, 1). You wish to collect enough samples so that your estimator is within an margin of error with probability at least . Show that if n
2
SM
22 log(2/), we will have the following guarantee: P(|^ | > ) .
N.B: In the first part of the class, we will study a procedure called empirical risk minimization for learning, which essentially chooses a model that performs well on the observed data. After learning, we need to demon- strate that our learned model will do well on future unobserved data. Hence, we need to find conditions under which a models performance on the observed dataset will translate to its future generalization performance. Concentration will be an important tool in such proofs.
-
-
Which estimator is better?
The sample mean is just one of several possible estimators for . Student A proposes the following alternative estima-
^
tor with some parameter (0, 1),
n
n
i
^ = X .
i=1
^
^
In this question, we will explore if and when could be a better estimator than SM.
1. [2 pts] (Biasvariance decomposition) First, show that the following holds for any estimator ,
R(^, ) = E[(^ )2] = (E[^] )2 + E [(^ E[^])2] . ^
` bias x ` variance x
-
[2 pts] Using the result from part 1, compute the risk of the estimator . Note that, unlike the sample mean,
^
the risk of depends on the true mean . ^
-
[2 pts] Show that there exists at least one value for such that is a strictly better estimator than SM. That
is, there exists R, such that, for all (0, 1), we have R(^^, ) < R(^SM, ). ^
-
[4 pts] (Maximum risk) Despite the result from part 3, Student B is not satisfied with Student As proposition, as an estimator should perform well for all values of , and not just for one value of . In particular, they argue that the worst-case risk over all should be small. They propose the following criterion, the maximum risk R, as a way to measure how well an estimator performs.
^
R() = sup R(, ) = sup E[( )2].
R ^ R ^
^
^
-
Compute R(SM) and R().
-
Based on the above answers, which estimator would you choose?
-
-
[5 pts] (Maximum risk over a bounded domain) Suppose we had prior knowledge that [0, 1]. While student A agrees with student Bs criterion, they argue that we should modify the definition of the maximum risk to incorporate this prior knowledge. They propose the following definition:
R() = sup R(, ) = sup E[( )2].
^ [0,1] ^ [0,1] ^
^
^
-
Compute R(SM) and R(), the maximum risk for the two estimators discussed above?
^
^
-
Is there any particular value of (possibly dependent on n and ) for which R() < R(SM)?
-
Based on the above answer, which estimator would you choose? Intuitively, explain the discrepancy in the conclusions in part 4 and part 5.
-
N.B: In the second part of the class, we will explore the concept of minimax optimality. Here, we will design algorithms that have small maximum risk over a class of distributions, and establish lower bounds to prove that no other algorithm can have a significantly smaller maximum risk. We will start with some simple estimation problems, and extend the ideas to regression, classification, density estimation, online learning, and bandits.
-
-
Understanding explorationexploiting trade-offs
This question is more difficult than the two previous questions. You are allowed to collaborate on this question with up to two classmates.
{ } N
N N
Consider the following game which proceeds over T rounds. You have access to two normal distributions 1 = (1, 2) and 2 = (2, 2), where 2 is known but 1, 2 [0, 1] are not. On each round t, you get to choose one distribution It 1, 2 , where It = i corresponds to choosing i = (i, 2). Once you draw the sample, you
earn a monetary reward that is equal to the value of the sample. If the value of the sample is negative, you should pay
t=1
that amount instead. Your total cumulative reward, over T rounds is T Xt where Xt It . We will measure how
well we perform via our average regret, defined below:
T
T
1
2
T
t
R = max{ , } 1 X .
t=1
We wish to design an algorithm whose average regret vanishes1 with T in expectation, i.e E[RT ] 0 as T .
Algorithm: A student proposes the following simple algorithm. First sample each of the distributions N times (where
N < T/2). Then, for the remaining T 2N rounds, sample the distribution with the highest observed sample mean
1Intuitively, if we knew a priori which mean was larger, we will always pull the arm with the highest mean and have E[RT ] = 0 as
1 E[ Xt] = max{1, 2}. If E[RT ] 0, this means we are able to learn which of the two distributions has a larger mean and converge
T t
towards the correct answer as we collect more samples.
using the N samples. That is, It is chosen as follows:
1
I
=
2 if N + 1 t 2N,
1 if t N,
t
if t > 2N and 1
^2,
^
^
2 if t > 2N and 1
N
< 2.
2N
where, 1 = Xt,
t=1
^2 =
t=N +1
Xt,
^
^
For what follows, let = |1 2| denote the gap between the two means.
-
[5 pts] (Regret decomposition) Establish the following identity for the expected average regret:
T
T
T
22
E[R ] = N + (T 2N ) r N !
Here, (x) = PZN(0,1)(Z < x) is the CDF of the standard normal distribution.
-
[2 pts] Using the result from part 1 and the fact that 1, 2 [0, 1], show the following upper bound on the expected average regret.
T T
E[R ] N + exp
N 2 42 .
You may use the following property about standard normals, which is a one-sided version of the inequality given in problem 1. If Z N(0, 1), then P(Z < ) e2 /2.
-
[3 pts] Use the result in part 2 to show the following upper bound.
N
1/2
E[RT ] T + C N , where, C = 2e .
Hint: Consider the function f (x) = log(x) x2, where > 0. What is the maximizer of f ?
-
[2 pts] (An optimal choice of N.) Specify a choice for N , depending only on and T , so that the upper bound in part 3 is minimized. Are you able to achieve E[RT ] 0 as T ? If so, at what rate does it go to zero?
-
[2 pts] (Explorationexploitation trade-off.) Let N denote the optimal choice in part 4. In words, explain what would happen had we chosen N N or N N .
-
N.B: In the third part of the class, we will study several models for adaptive decision-making. The model discussed in this question is an example of a stochastic bandit, which is one paradigm for decision-making. In bandit settings, we often have to trade-off between exploration (learning about the environment) and exploitation (leveraging what we have learned to maximize rewards). The above algorithm is a simple, albeit sub-optimal, procedure where we have an explicit exploration phase (first 2N rounds) and exploitation phase (last T 2N rounds). In class, we will look at better algorithms to manage this trade-off which have faster rates of convergence.
Reviews
There are no reviews yet.