CS/ECE/STAT-861: Theoretical Foundations of Machine Learning
University of WisconsinMadison, Fall 2023
Homework 2. Due 10/27/2023, 11.00 am
Instructions:
-
Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.
-
I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
-
You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
-
Unless otherwise specified, you may use any result we have already proved in class. You do not need to prove them from scratch, but clearly state which result you are using.
-
Solutions to some of the problems may be found as examples or exercises in the suggested textbooks or other resources. You are encouraged to try the problems on your own first before searching for the solution. If you find an existing solution, first read and understand the proof, and then write it in your own words. Please indicate any references you have used at the beginning of your solution when you turn in your homework.
-
Collaboration: You are allowed to collaborate on in groups of size up to 3 on each problem. If you do so, please indicate your collaborators at the beginning of your solution.
-
Lower bounds with mixtures
In this question, you will prove a variant of our current framework for proving minimax lower bounds that involve mixtures of distributions.
-
[5 pts] We observe data S drawn from some distribution P belonging to a family of distributions P. We wish to estimate a parameter (P ) of interest via a loss , where : R+ R+ is a non-decreasing function and : R+ is a metric. Let P1, . . . , PN be subsets of P, and let j denote a prior on Pj. Let P j denote the mixture,
P j(S A) = EP j [ESP [1(S A)]] .
Let = minj/=k infP Pj ,P ‘Pk ((P ), (P )). Let be a function which maps the data to [N ] and ^ be an
estimator which maps the data to . Then, prove that
R = inf sup ES h (P ), ^(S) i inf max P j((S) /= j).
^ P P 2 j[N]
(correction: The RHS previously said inf instead of inf. KK)
^
-
[3 pts] Suppose we observe n i.i.d datapoints S = {X1, . . . , Xn} drawn from some P P. Let {P0, P1, . . . , PN }
P. Let P = 1 N P n. Show that,
N j=1 j
n
4
2
0
R 1 exp KL(P n, P )
-
[2 pts] Using the result from part 2, briefly explain why using mixtures in the alternatives can (i) lead to tighter lower bounds, but (ii) are difficult to apply.
-
-
Density estimation in a Holder class
Let H(2, L, B), defined below, denote the bounded second order Holder class in [0, 1]. It consists of functions whose derivatives are L-Lipschitz.
H(2, L, B) = {f : [0, 1] [0, B]; |f (x1) f (x2)| L|x1 x2| for all x1, x2 R}
Let P denote the set of distributions whose densities are in H(2, L, B). We observe n samples S = {X1, . . . , Xn}
2
drawn i.i.d from some P P and wish to estimate its density p in the L2 loss (p1, p2) =
p1 p2
2. The
minimax risk is
n
R = inf
sup
ES
p p^
2 .
p^ pH(2,L,B)
2
In this question, you will show that the minimax rate1 for this problem is (n4/5).
n
-
[15 pts] (Lower bound) Using Fanos method, or otherwise, show that R (n4/5).
^
-
[15 pts] (Upper bound) Design an estimator p for p and bound its risk by O(n4/5).
Hint: If you choose to use a kernel density estimator, consider the first order Taylor expansion of p and then apply the Holder property.
-
[4 pts] (High dimensional setting) In words, briefly explain how you can extend both the upper and lower bounds for density estimation in d dimensions. The d dimensional second-order Holder class, defined below, consists of functions whose partial derivatives are Lipschitz.
H(2, L, B) = f : [0, 1]d [0, B]; f
xi
2+d
is LLipschitz for all i [d] .
1Recall from class that the minimax rate for a Holder class of order
is O n 2 in Rd.
You can focus only on the key differences. A detailed proof is not necessary.
-
[4 pts] (Lipschitz second derivatives) In words, briefly explain how you can extend both the upper and lower bounds if the densities belonged to the third order Holder class in one dimension, defined below:
H(3, L, B) = {f : [0, 1] [0, B]; |f (x1) f (x2)| L|x1 x2| for all x1, x2 R}
Please focus only on the key differences. A detailed proof is not necessary.
Hint: For the upper bound, if you choose to use a kernel density estimator, you may consider a kernel of the form K(u) = 1(|u| 1/2)( u2) for appropriately chosen , .
-
-
Lower bounds on the excess risk for prediction problems
In this question, we will develop a framework for lower bounding the excess risk of prediction problems. We will use this to establish a lower bound on the estimation error for binary classificaion in a VC class.
^
Let Z be a data space and H be a hypothesis space. Let f : H Z R be the instance loss, where f (h, Z) is the loss of hypothesis h on instance Z. Let F (h, P ) = EZP [f (h, Z)] be the population loss of hypothesis h on distribution P , and let L(h, P ) = F (h, P ) infhH F (h, P ) denote the excess population loss. Let h be an estimator which maps a dataset to a hypothesis in H. The risk of the estimator is
hH
R(^h, P ) = E L(^h, P ) = E F (^h, P ) inf F (h, P ).
^h
Here, the expectation is taken with respect to the data. The minimax risk is R = inf sup R(^h, P ).
P P
^
Example: In classification, f (h, (X, Y )) = 1(h(X) /= Y ) is the 01 loss, F (h, P ) is usually called the risk of hypothesis h. The infimum infh F (h, P ) is attained by the Bayes optimal classifier, and L(h, P ) is the excess risk
of hypothesis h. This framework can be used to lower bound the expected excess risk R(h, P ) of classification (and regression) problems. When ^h chooses a hypothesis in some hypothesis class H, then R(^h, P ) is the estimation error.
-
[6 pts] (Reduction to testing) For two distributions P, Q, we define the separation (P, Q) as,
L(h, Q) = L(h, P ) h H}.
(P, Q) = sup 0; L(h, P ) = L(h, Q) h H,
A dataset S is drawn from some distribution P P. Let {P1, . . . , PN } P such that (Pj, Pk) for all
j /= k. Let be any function which maps S to [N ]. Show that,
R inf max Pj( /= j).
j[N]
We can establish the following statements from the above result when S consists of n i.i.d data points. You do not need to prove them for the homework, but are encouraged to verify that they are true.
n
Le Cams method: Let {P0, P1} P such that (P0, P1) and KL(P0, P1) log(2)/n. Then, R /8.
Local Fano method: Let {P1, . . . , PN } P such that (Pj, Pk) and KL(Pj, Pk) log(N )/4n for all
n
j /= k. Suppose N 16. Then, R /2.
You may use these results when solving the problems below.
-
(One sided-threshold classifiers) Consider a binary classification problem with input in X = [0, 1] and label in
{0, 1}. We observe S = {(X1, Y1), . . . , (Xn, Yn)} drawn i.i.d from some distribution P P, where P consists of distributions whose marginal p(x) is the uniform distribution on [0, 1].
Let H = {ht() = 1( t); t [0, 1]} be the class of one-sided threshold classifiers. For any h H, let
f (h, (X, Y )) = 1(h(X) /= Y ) be the 01 loss and let F (h, P ) = EX,Y P [1(h(X) /= Y )].
^
-
[8 pts] Using Le Cams method, show that for any estimator h which maps the dataset to a hypothesis in
H, there exists some distribution P P such that
hH
n
E F (^h, P ) inf F (h, P ) + r 1 ! .
^
-
[2 pts] Note that we have not assumed that h() = P (Y = 1|X = ) belongs to H for all P P. We have however assumed that the estimator h always chooses some hypothesis in H. Briefly, explain why this assumption is necessary and where the proof breaks without this assumption.
-
-
[15 pts](Classification in a VC class) Let X be a given input space and let P be all distributions supported on X {0, 1}. We observe S = {(X1, Y1), . . . , (Xn, Yn)} drawn i.i.d from some distribution P P. Let H {h : X {0, 1}} be a hypothesis class with finite VC dimension d. For any h H, let f (h, (X, Y )) = 1(h(X) /= Y ) be the 01 loss and let F (h, P ) = EX,Y P [1(h(X) /= Y )].
In Homework 1, you showed the following upper bound on the estimation error of the ERM estimator ^hERM,
E F (^hERM, P ) inf F (h, P ) + O r d log(n)! .
hH n
^
Here, the expectation is with respect to the dataset S. Using the local Fano method, show that this is rate is
essentially unimprovable. That is, show that for any estimator h which maps the dataset to a hypothesis in H, there exists some distribution P P such that, for sufficiently large d,
hH
n
E F (^h, P ) inf F (h, P ) + r d ! .
-
-
Explore-then-commit for Karmed bandits
In this question, we will upper and lower bound the regret for the explore-then-commit algorithm, described below.
Algorithm 1 Explore-then-Commit
Given: time horizon T , number of exploration rounds m (< T/K)
Pull each arm i [K] m times in the first mK rounds.
i[K] m
t=1
Set A = argmax 1 mK 1(At = i)Xt.
Pull arm A for the remaining T mK rounds.
i
Let = {i}i[K] be a sub-Gaussian K-armed bandit model, i.e each i is a sub-Gaussian distribution. Let i = EX [X] denote the mean of arm i, = maxj[K] j be the highest mean, and let i = i be the gap between the optimal arm and the ith arm. Assume, without loss of generality, that i [0, 1] for all i. Let RT (m, ) denote the regret when we execute the above algorithm on with m exploration rounds,
RT (m, ) = T E
T
“
t=1
Xt# .
m
-
[5 pts] (Gap-dependent bound) Show that there exists global constants C1, C2 such that
RT (m, ) m
i;i>0
i + C1(T mK)
i>0
i exp
2
i .
C22
-
[6 pts] (Gap-independent bound) Let P denote the class of all sub-Gaussian bandits whose means are bounded between 0 and 1. Show that for a suitable choice of m, say m (possibly dependent on T and K), that we have
sup RT (m, ) O(K1/3T 2/3).
P
-
[10 pts] (Lower bound) Show that the result in part 2 cannot be improved (say via a tighter upper bound analysis) for the explore-then-commit algorithm. That is, show
-
inf sup RT (m, ) (K1/3T 2/3).
mN P
Hint: One approach is to adopt a similar technique to the proof of the general lower bound for K-armed bandits, but adapt it to the structure of the explore-then-commit algorithm. Your alternatives will need to depend on the specific choice of m to get a tight lower bound. To do so, you should carefully consider the failure cases if m is picked to be too large or too small.
Reviews
There are no reviews yet.