Machine Learning Theory (CSC 482A/581A)
Problem Set 2
Instructions:
- You must write up your solutions individually.
- You may have high-level discussions with up to 2 other students registered in the course. If you discuss problems with others, include at the top of your submission: their names, V#s, and the problems discussed.
- Your solutions should be clear and concise (and legible!). You are strongly encouraged to type up your solutions in LaTeX. For any problems where you only have a partial solution, be clear about any parts of your solution for which you have low confidence.
- If submitted through conneX, the due date is Friday, February 23rd, 7pm. This is a hard deadline. If submitted in writing, submit at the beginning of class (12:30pm) Friday, February 23rd. You are thus incentivized to submit online (using LaTeX) as you get more time!
Questions:
- Let X = R2 and take F to be the set of all convex polygons; the classifier corresponding to a convex polygon labels as positive all points inside the polygon (including the boundary) and labels all other points as negative. Prove that VCdim(F) = .
d
- Let F be the class of linear separators in d dimensions, so that F = fw,b :wR ,bR
with fw,b(x) = 1 [w, x + b 0].
- (a) Prove that VCdim(F ) d + 1.
- (b) Radons Theorem states that any set of d + 2 points in Rd can be partitioned into two sets A and B such that the convex hulls of A and B intersect. Using Radons Theorem, prove that VCdim(F ) d + 1 (and hence, combined with part (a), we may conclude that VCdim(F) = d + 1).
- (c) Next, prove Radons Theorem. Any valid proof is allowed. Here is the start of one potential proof. Recall from linear algebra that any d + 1 points x1, . . . , xd+1 Rd must be linearly dependent, i.e., there exists a vector Rd+1 not equal to the zero vector such that
d+1
jxj =0. j=1The hint is to first prove that any set of d + 2 points x1,,xd+2 Rd must be affine independent, meaning that there exists a vector Rd+2 not equal to the zero vector such that
d+1
jxj =0 j=1
and
d+2
j =0. j=1
1
3. Suppose that P is a probability distribution over Rd, and let the training sample X1, . . . , Xn, Xn+1 be i.i.d. samples with distribution P . We say that (X1, . . . , Xn) is s-sparse if
Xj0 s forallj[n],
where, for any vector x, the l0 norm x0 is defined as the number of non-zero components
in x.
Let s be the minimum value of s {0,1,,d} such that (X1,,Xn) is s-sparse.
(a) Derive an upper bound (which holds with high probability over X1, . . . , Xn) on the prob- ability that Xn+1 is s-sparse. Specifically, your bound should be of the form:
log 1
With probability at least 1 , Pr(Xn+10 > s) = O worse than O 1 (so O logn is not allowed).
.
The bound can also depend on the dimension, but the rate with respect to n cannot be
n
(b) If your upper bound from part (a) depended on the dimension d, it degrades severely as
d . Derive an upper bound that is dimension-free. Unlike part (a), the rate with
respect to n now can be O logn . n
4. Suppose that P is a probability distribution over the unit Euclidean ball in Rd, and let X1, . . . , Xn be i.i.d. samples with distribution P .
Using tools from class, prove that the average distance (considering all pairs) between n points is tightly concentrated around its expectation. That is, show that
nn
EX,YP XY2. Specifically, you should show that with probability at least 1 ,
1 log
1
XXE
i j2 X,YP
XY =O
2 n
.
1
2 1i<j n
n Xi Xj2
is tightly concentrated around
n
2 1i<jn
2
Reviews
There are no reviews yet.