- Union Of Intervals. In this question, we will study the hypothesis class of a finite union of disjoint intervals, and the properties of the ERM algorithm for this class.
To review, let the sample space be X = [0,1] and assume we study a binary classification problem, i.e. Y = {0,1}. We will try to learn using an hypothesis class that consists of k intervals. More explicitly, let I = {[l1,u1],,[lk,uk]} be k disjoint intervals, such that 0 l1 u1 l2 u2 uk 1. For each such k disjoint
intervals, define the corresponding hypothesis as
(
1 if x [l1,u1],,[lk,uk]
hI(x) =
0 otherwise
Finally, define Hk as the hypothesis class that consists of all hypotheses that correspond to k disjoint intervals:
Hk = {hI|I = {[l1,u1],,[lk,uk]}, 0 l1 u1 l2 u2 uk 1}
We note that Hk Hk+1, since we can always take one of the intervals to be of length 0 by setting its endpoints to be equal. We are given a sample of size m = hx1,y1i,,hxn,ymi. Assume that the points are sorted, so that 0 x1 < x2 < < xm 1.
Submission Guidelines:
- Download the files py and intervals.py from Moodle. You should implement only the missing code in skeleton.py, as specified in the following questions. In every method description, you will find specific details on its input and return values.
- Your code should be written with python 3.
- Make sure to comment out / remove any code which halts the code execution, such as matplitlib popup.
- Your submission should include exactly two files: py, intervals.py.
Explanation on intervals.py:
The file intervals.py includes a function that implements an ERM algorithm for Hk. Given a sorted list xs = [x1,,xm], the respective labeling ys = [y1,,ym] and k, the given function find best interval returns a list of up to k intervals and their error count on the given sample. These intervals have the smallest empirical error count possible from all choices of k intervals or less.
Note that in sections (d)-(f) you will need to use this function for large values of m. Execution in these cases could take time (more than 10 minutes for experiment), so plan ahead.
4 Handout Homework 2: March 22, 2020
(a) (7 points) Assume that the true distribution P[x,y] = P[y|x] P[x] is: x is distributed uniformly on the interval [0,1], and
(
0.8 if x [0,0.2] [0.4,0.6] [0.8,1]
P[y = 1|x] =
0.1 if x [0.2,0.4] [0.6,0.8]
and P[y = 0|x] = 1 P[y = 1|x].
Write a function that draws m pairs of (x,y) according to the distribution P. Use it to draw a sample of size m = 100 and create a plot:
- Plot the points and their label (have the y axis in range 0.1,1.1 for clarity of presentation).
- Mark the lines x = 0.2,0.4,0.6,0.8 clearly on the plot.
- Run the find_best_interval function on your sample with k = 3, and plot the intervals clearly.
- (7 points) Note that here, we know the true distribution P, so for every given hypothesis h Hk, we can calculate error(h) precisely. What is the hypothesis with the smallest error?
- (7 points) Write a function that, given a list of intervals, calculates the error of the respective hypothesis. Then, for k = 3, m = 10,15,20,,100, perform the following experiment T = 100 times: (i) Draw a sample of size m and run the ERM algorithm on it; (ii) Calculate the empirical error for the returned hypothesis; (iii) Calculate the true error for the returned hypothesis. Plot the average empirical and true errors, averaged across the T runs, as a function of m. Discuss the results. Do the empirical and true error decrease or increase in m? Why?
- (7 points) Draw a data set of m = 1500 samples. Find the best ERM hypothesis for k = 1,2,,20, and plot the empirical and true errors as a function of k. How does the error behave? Define k to be the k with the smallest empirical error for ERM? Does this mean the hypothesis with k intervals is a good choice?
- (7 points) Now we will use the principle of structural risk minimization (SRM), to search for a k that gives good test error. Let = 0.1:
- Use your results from question 3 in the theoretical part and the VC confidence bound to construct a penalty function.
- Draw a data set of m = 1500 samples, run the experiment in (d) again, but now plot two additional lines as a function of k: 1) the penalty for the best ERM hypothesis and 2) the sum of penalty and empirical error.
- What is the best value for k in each case? is it better than the one you chose in (d)?
(f) Here we will use holdout-validation to search for a k that gives good test error. Draw a data set of m = 1500 samples and use 20% for a holdoutvalidation. Choose the best hypothesis based on 3 experiments. Discuss how close this gets you to finding the hypothesis with optimal true error.
Reviews
There are no reviews yet.