,

[SOLVED] CS/ECE/STAT-861: Theoretical Foundations of Machine Learning HW 1

$25

File Name: CS/ECE/STAT-861:_Theoretical_Foundations_of_Machine_Learning_HW_1.zip
File Size: 612.3 KB

SKU: 32434573423-1-1-1-1 Categories: , Tags: , , , ,
5/5 - (1 vote)
hw1

CS/ECE/STAT-861: Theoretical Foundations of Machine Learning

University of WisconsinMadison, Fall 2023

Homework 1. Due 10/06/2023, 11.00 am

Instructions:

  1. Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.

  2. I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.

  3. You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.

  4. Unless otherwise specified, you may use any result we have already proved in class. You do not need to prove them from scratch, but clearly state which result you are using.

  5. Solutions to some of the problems may be found as examples or exercises in the suggested textbooks or other resources. You are encouraged to try the problems on your own first before searching for the solution. If you find an existing solution, first read and understand the proof, and then write it in your own words. Please indicate any references you have used at the beginning of your solution when you turn in your homework.

  6. Collaboration: You are allowed to collaborate on in groups of size up to 3 on each problem. If you do so, please indicate your collaborators at the beginning of your solution.

  1. PAC Learning and Empirical Risk Minimization

    ^ ^

    1. [4 pts] (What is wrong with this proof?) We perform empirical risk minimization (ERM) in a finite hypothesis class H using an i.i.d dataset S of n points. Let h argminhH R(h) be an optimal classifier in the class, and let h argminhH R(h) minimize the empirical risk of the dataset S. A student offers the following proof and

      claims that it is possible to bound the estimation error without any dependence on |H|.

      ^

      1

      2

      1. Let B1 = {R(h) R(h) > } denote the bad event that the empirical risk of h is larger than its true risk. By Hoeffdings inequality we have P(B ) e2n .

        2

        ^ ^ ^ ^

        2

      2. Similarly, Let B2 = {R(h) R(h) > } denote the bad event that the empirical risk of h is smaller than its true risk. By Hoeffdings inequality we have P(B ) e2n . (correction: This previously said

        2e2n2 . Thanks to Zhihao for pointing this out. KK)

        ^ ^ ^

        As R(h) R(h), we have,

        ^ ^ ^ ^ ^

        R(h) R(h) R(h) R(h) + R(h) R(h) 2

        1

        2

        under the good event G = Bc Bc which is true with probability at least 1 2e2n2 . This result does not

        depend on |H| and even applies to infinite hypothesis classes provided there exists h which minimizes the risk.

        Which sentence below best describes the mistake (if any) with this proof? State your with an explanation. If you believe there is a mistake, be as specific as possible as to what the mistake is.

        1. Both statement (i) and statement (ii) are incorrect.

        2. Only statement (i) is incorrect. Statement (ii) is correct.

        3. Only statement (ii) is incorrect. Statement (i) is correct.

        4. Both statements are correct. There is nothing wrong with this proof.

    2. [6 pts] (PAC bound) Prove the following result which was presented but not proved in class.

      such that with probability at least 1 2e2n2 , we have

      Let H be a hypothesis class with finite Radn(H). Let ^h be obtained via ERM using n i.i.d samples. Let > 0.

      Then, there exists universal constants C1, C

      2

      ^ H

      R(h) inf R(h) + C1Radn( ) + C2.

      hH

    3. [3 pts] (Sample complexity based on VC dimension) Say H has a finite VC dimension d. Let (0, 1). Using the result/proof in part 2 or otherwise, show that there exist universal constants C3, C4 such that when n d, the following bound holds with probability at least 1 .

      hH

      3

      n

      4

      n

      R(^h) inf R(h) + C r d log(n/d) + d + C s 1 log 2 .

      ^

      r

      r

    4. [3 pts] (Bound on the expected risk) The above results show that R(h) is small with high probability. Using the results/proofs in parts 2 and 3 or otherwise, show that it is also small in expectation. Specifically, show that there exist universal constants C5, C6 such that the following bound holds.

      ^

      E[R(h)] inf R(h) + C5

      hH

      d log(n/d) + d

      n

      + C6

      log(4n)

      n

      1

      + n.

      Here, the expectation is with respect to the dataset S.

      For parts 2, 3, and 4, of this question, if you can prove a bound that has similar higher order terms but differs in additive/multiplicative constants or poly-logarithmic factors, you will still receive full credit.

  2. Rademacher Complexity & VC dimension

    1. [5 pts] (Empirical Rademacher complexity) Consider a binary classification problem with the 01 loss l(y1, y2) =

      1(y1 /= y2) and where X = R. Consider the following dataset S = {(x1 = 0, y1 = 0), (x2 = 1, y2 = 1)}.

      1. Let H1 = {ha(x) = 1(x a); a R} be the hypothesis class of one-sided threshold functions. Compute the empirical Rademacher complexity Rad(S, H1).

      2. Let H2 = {ha(x) = 1(x a); a R} {ha(x) = 1(x a); a R} be the class of two-sided threshold functions. Compute the empirical Rademacher complexity Rad(S, H2).

      3. Are the values computed above consistent with the fact that H1 H2?

    2. [6 pts] (VC dimension of linear classifiers) Consider a binary classification problem where X = RD is the D-dimensional Euclidean space. The class of linear classifiers is given by H = {hw,b(x) = 1[wTx + b 0]; w RD, b R}. Prove that the VC dimension of this class is dH = D + 1. (correction: Previously this said H = {hw,b(x) = wTx + b 0 w Rd, b R}. KK)

    3. (Interval classifiers) Let X = R. Consider the class of interval classifiers, given by

      H = {ha,b(x) = 1(a x b); a, b R, a b}.

      1. [4 pts] What is the VC dimension d of this class?

        i=0

        i

      2. [8 pts] Show that Sauers lemma is tight for this class. That is, for all n, show that g(n, H) = d n .

    4. (Union of interval classifiers) Let X = R. Consider the class of the union of K interval classifiers, given by

      k

      H = {ha,b(x) = 1(k {1, . . . , K} s.t ak x bk); a, b R , ak bk k}.

      1. [4 pts] What is the VC dimension d of this class?

        i=0

        i

      2. [8 pts] Show that Sauers lemma is tight for this class. That is, for all n, show that g(n, H) = d n .

      Hint: The following identity from combinatorics, which we used in the proof of Sauers lemma, may be helpful.

      m > k, m = m 1 + m 1 .

      k k k 1

    5. [6 pts] (Tightness of Sauers lemma) Prove the following statement about the tightness of Sauers lemma when

      X = R: For all d > 0, there exists a hypothesis class H {h : R {0, 1}} with VC dimension dH = d such

      i=0

      i

      that, for all dataset sizes n > 0, we have g(n, H) = d n .

      Keep in mind that the hypothesis class H should depend on d but not on n.

      Hint: One approach will be to use the results from part 4 which will allow you to prove the results for even d. You should consider a different hypothesis class to show this for odd d.

  3. Relationship between divergences

    q(x)

    Let P, Q be probabilities with densities p, q respectively. Recall the following divergences we discussed in class KL divergence: KL(P, Q) = log p(x) p(x)dx.

    Total variation distance: TV(P, Q) = supA |P (A) Q(A)|. L1 distance:

    P Q

    1 = |p(x) q(x)|dx.

    Hellinger distance: H2(P, Q) = p(x) q(x) 2 dx.

    Finally, let

    P Q

    = min(p(x), q(x))dx denote the affinity between two distributions. When we have n i.i.d observations, let P n, Qn denote the product distributions.

    (correction: Previously, the definition of the Hellinger distance said H and not H2. Thanks to Yixuan for pointing this out. KK)

    Prove the following statements:

    1. [3 pts] KL(P n, Qn) = nKL(P, Q).

      2

    2. [3 pts] H2(P n, Qn) = 2 2 1 1 H2(P, Q) n.

      1

      2

    3. [3 pts] TV(P, Q) = P Q

      1.

      Hint: Can you relate both sides of the equation to the set A = {x; p(x) > q(x)}?

    4. [3 pts] TV(P, Q) = 1

      P Q

      .

    5. [3 pts] H2(P, Q)

      P Q

      1.

Hint: What can you say about (a b)2 and |a2 b2| when a, b > 0?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS/ECE/STAT-861: Theoretical Foundations of Machine Learning HW 1
$25