[SOLVED] CS 189 hw13

$25

File Name: CS_189_hw13.zip
File Size: 103.62 KB

5/5 - (1 vote)

CS 189 Introduction to Machine Learning Fall 2017

This homework is due Friday, December 1 at 10pm. 1 Getting Started

HW13

You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:

  1. Submit a PDF of your writeup to assignment on Gradescope, HW[n] Write-Up
  2. Submit all code needed to reproduce your results, HW[n] Code.
  3. Submit your test set evaluation results, HW[n] Test Set.

After youve submitted your homework, be sure to watch out for the self-grade form.

  1. (a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
  2. (b) Please copy the following statement and sign next to it:

    I certify that all solutions are entirely in my words and that I have not looked at another students solutions. I have credited all external sources in this write up.

CS 189, Fall 2017, HW13 1

2 Function approximation via neural networks

In this problem, you will prove a special case of a classical result in the approximation theory of neural networks. In particular, you will show that given a natural class of functions, a two-layer neural network (having just one hidden layer) with certain activation functions approximates every function in the class exponentially better than polynomial regression using the same number of co- efficients. Clearly then, neural networks incur a smaller bias than polynomials for the same number of parameters. The related question of variance of neural network models is also interesting, but will not addressed by this problem.

The class of functions that we will be interested in approximating is given by the set d1

Fcos = f :R Rs.t. f(x)= 2w cos(w,x)forsomew=0 . 1

Here, 0 denotes the zero vector in Rd .
Define a thresholding function : R R as any bounded function for which (z) 1 as z

and (z) 0 as z . We will consider a neural network function h : Rd R of the form

p
h(x)= ck(wk,x+bk),

k=1
where (wk,bk,ck) Rd RR for each k = 1,2,, p. Denote the class of such functions having

p parameters and p |ck| 1 by Hp. k=1

Given any function f Fcos, we will show that it can be approximated effectively by a linear combination of a small number of threshold functions. In particular, we will measure the quality of approximation by the average error over the set [0, 1]d as

2
E(f,p)= inf (f(x)h(x)) dx. (1)

hHp x[0,1]d
In this problem, you will show that for all functions f Fcos, we have

E(f,p) 1, p

thus showing that we can approximate any function in Fcos to within error using at most 1/ lin-

ear combinations of threshold functions. Since each threshold function involves d + 2 parameters,

we will have shown that an -approximation is achievable using O d scalar parameters.

In contrast, for polynomial approximators, we will show that at least 1 d parameters are neces- d

sary to obtain a similar -approximation.

Parts (c-h) will walk you through a proof that linear combinations of thresholding functions (ex- amples of which you will see in part (a)) are good approximators of functions in Fcos. You will see examples of functions in Fcos in part (b). We will release a note on Piazza showing the approxima- tion lower bound for polynomial functions mentioned above, so you have an idea why polynomials perform worse.

CS 189, Fall 2017, HW13 2

Disclaimer: We will deliberately avoid some mathematical subtleties in the following proof that are essential to handle in order for it to qualify as a complete argument. The goal here is for you to understand the concrete higher order message.

  1. (a) Verify that the following commonly used functions are thresholding functions. (a)(t)= 1t
  2. (b) Fix d = 2, and show 3D plots of functions in the class Fcos, for five sufficiently different choices of w; you will be plotting f(x) for x [0,1]2. The goal is for you to convince your- self that these are not pathologically constructed functions; they look like many underlying regression functions we might want to approximate in practice. What is true but will not be shown in this problem is that many practical functions of interest including Gaussian PDFs and functions with sufficiently many derivatives can be expressed as a linear combination of functions in Fcos.
  3. (c) Define the closure cl(T ) of a set T as the union of T with its limit points, e.g. the closure of theset{xRd :x2 <1}isgivenbytheset{xRd :x2 1}.

    Define the step function S(t) to be the indicator function that t 0, i.e., S(t) = 1 for all t 0, and 0 otherwise. Show that for each (w,b) Rd R, we have

    S(w,x+b) cl({p : p(x) = (w,x+b) or p(x) = (w,x+b), for some w,b}). Showing pointwise convergence is sufficient.

    Hint: It should be sufficient to show that threshold functions with appropriately scaled argu- ments are step functions in the limit. Show this, and argue why it is sufficient.

  4. (d) We will now try to approximate the function cos(w,x) with a linear combination of step functions of the form S(w,x). Notice that it is sufficient to simply approximate cos(y) by S(y), since the arguments of the two functions are identical. Also notice that since we only care about approximating the function within the set x 1, we have by Holders inequality that y = w,x w1x = w1. Executing a change of variables y = y/w1, it suffices now approximate the function c(y) = cos(w1y) within the range y [1,1].

    Show that for any fixed > 0, we can design a linear combination of step functions that is -close to c(y) for every y [1,1].

    Hint: Recall how a continuous function can be successively approximated by finer and finer piecewise constant functions. It may be helpful to define a finite collection of points P on the set [1,1] such that c(y) changes its value by at most for successive points.

  5. (e) Let us consider the absolute sum of coefficients of the linear combination you computed above. You should have computed it to be

    |c(zi)c(zi1)|, i

1+e
(b) ReLu(t)ReLu(t1)

CS 189, Fall 2017, HW13

3

3

where the points {zi}i are the points defined in P . Show that this sum is bounded by 2w1. Conclude that for every w = 0, we have approximated cos(w, x) by a linear combination of step functions having sum of absolute coefficients at most 2w1.

Hint: Finite differences are derivatives in the limit. The sum can be upper bounded by an appropriate integral.

  1. (f) Denote the closed convex hull of a set T by conv(T ); this is simply the closure of the convex hull of T . Using the previous parts, show that

    Fcos conv({p : p(x) = (w,x+b) or p(x) = (w,x+b), for some w,b}).
    We have thus shown that the function class of interest can be represented by a convex com-

    bination of threshold functions and negative threshold functions.

  2. (g) How many threshold functions do you need to approximate any function f Fcos to within

    some error ? This will be shown in the next two parts via an existence argument.

    Notice that roughly speaking, we have by definition of the convex hull that f = mi=1 cigi forsomemfunctionsgi(x)=(wi,x+b),andci 0withici =1. LetGbearandom variable such that G = gi with probability ci. With p independent samples G1,G2,G3,, define

    1p
    fp = p Gi.

    i=1
    Let g h2 := x[0,1]d (g(x) h(x))2dx for any two functions g and h. Show that

    E[fp f2] 1, (2) p

    where the expectation is taken over the random samples G1,,Gp. You may assume that expectations commute with integrals, i.e., you can switch the order in which you take the two.

  3. (h) Use the above part to argue that there must exist a convex combination of p threshold functions such that equation (2) is true for a deterministically chosen fp. Conclude that for all functions f Fcos, we have

    E(f,p) 1, p

    where E ( f , p) was defined in equation (1).

CNNs on Fruits and Veggies

In this problem, we will use the dataset of fruits and vegetables that was collected in HW5 and HW6. The goal is to accurately classify the produce in the image. In prior homework, we ex- plored how to select features and then use linear classification to learn a function. We will now

CS 189, Fall 2017, HW13 4

explore using Convolutional Neural Networks to optimize feature selection jointly with learning a classification policy.

Denote the input state x R90903, which is a down sampled RGB image with the fruit centered in it. Each data point will have a corresponding class label, which corresponds to their matching produce. Given 25 classes, we can denote the label as y {0, , 24}.

The goal of this problem is twofold. First you will learn how to implement a Convolutional Neural Network (CNN) using TensorFlow. Then we will explore some of the mysteries about why neural networks work as well as they do in the context of a bias variance trade-off.

Note all python packages needed for the project, will be imported already. DO NOT import new Python libraries. Also, this project will be computationally expensive on your computers CPU. Please message us on Piazza if you do not have a strong computer and we can arrange for you to use EC2.

  1. (a) To begin the problem, we need to implement a CNN in TensorFlow. In order to reduce the burden of implementation, we are going to use a TensorFlow wrapper known as slim. In the starter code is a file named cnn.py, the network architecture and the loss function are currently blank. Using the slim library, you will have to write a convolutional neural network that has the following architecture:

    (a) Layer 1: A convolutional layer with 5 filters of size 15 by 15 (b) Non-Linear Response: Rectified Linear Units
    (c) A max pooling operation with filter size of 3 by 3
    (d) Layer 2: A Fully Connected Layer with output size 512.

    (e) Non-Linear Response: Rectified Linear Units
    (f) Layer 3: A Fully Connected Layer with output size 25 (i.e. the class labels)

    (g) Loss Layer: Softmax Cross Entropy Loss

    In the file example cnn.py, we show how to implement a network in TensorFlow Slim. Please use this as a reference. Once the network is implemented run the script test cnn part a.py on the dataset and report the resulting confusion matrix. The goal is to ensure that your network compiles, but we should not expect the results to be good because it is randomly initialized.

  2. (b) The next step to train the network is to complete the pipeline which loads the datasets and offers it as mini-batches into the network. Fill in the missing code in data manager.py and report your code.
  3. (c) We will now complete the iterative optimization loop. Fill in the missing code in trainer.py to iteratively apply SGD for a fix number of iterations. In our system, we will be using an extra Momentum term to help speed up the SGD optimization. Run the file train cnn.py and report the resulting chart.

CS 189, Fall 2017, HW13 5

4

  1. (d) To better understand, how the network was able to achieve the best performance on our fruits and veggies dataset. It is important to understand that it is learning features to reduce the dimensionality of the data. We can see what features were learned by examining the response maps after our convolutional layer.

    The response map is the output image after the convolutional has been applied. This image can be interpreted as what features are interesting for classification. Fill in the missing code in viz features.py and report the images specified.

  2. (e) Given that our network has achieved high generalization with such low training error, it suggests that a high variance estimator is appropriate for the task. To better understand why the network is able to work, we can compare it to another high variance estimator such as nearestneighbors. Fillinthemissingcodeinnn classifier.pyandreporttheperformance as the numbers of neighbors is swept across when train nn.py is run.

Your Own Question

Write your own question, and provide a thorough solution.

Writing your own problems is a very important way to really learn material. The famous Blooms Taxonomy that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we dont want the same to be true about the highest level.

As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.

Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You dont have to achieve this every week. But unless you try every week, it probably wont happen ever.

CS 189, Fall 2017, HW13 6

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS 189 hw13
$25