5/5 - (1 vote)

Machine Learning
Lecture 6: Optimization
Prof. Dr. nnemann
Data Analytics and Machine Learning Technical University of Munich

Winter term 2020/2021

Motivation
Many machine learning tasks are optimization problems Examples weve already seen:
Linear Regression w = arg minW 12 (Xw y)T (Xw y)
Logistic Regression w = arg minW ln p(y | w, X)
Other examples:
Support Vector Machines: find hyperplane that separates the classes
with a maximum margin
k-means: find clusters and centroids such that the squared distances is minimized
Matrix Factorization: find matrices that minimize the reconstruction error
Neural networks: find weights such that the loss is minimized
And many more
Optimization 2
Data Analytics and Machine Learning

General Task
Let denote the variables/parameters of our problem we want to learn
e.g. = w in Logistic Regression
Let X denote the domain of ; the set of valid instantiations
constraints on the parameters!
e.g. X = set of (positive) real numbers
Let f() denote the objective function e.g. f is the negative log likelihood
Goal: Find solution minimizing function f : =argmin2X f()
find a global minimum of the function f!
similarly, for some problems we are interested in finding the maximum
Optimization 3
Data Analytics and Machine Learning

Introductory Example
Goal: Find minimum of function f()=0.64 53 +132 12+5
Unconstrained optimization + dierentiable function
Necessary condition for minima
Gradient = 0 Sucient?
General challenge: multiple local minima possible
Optimization 4
Data Analytics and Machine Learning

Convexity: Sets
X is a convex set i
forallx,y2X itfollowsthatx+(1)y2X for2[0,1]
Optimization 5
Data Analytics and Machine Learning

Convexity: Functions
f(x) is a convex function on the convex set X i
for all x,y 2 X : f(x)+(1)f(y) f(x+(1)y) for 2 [0,1]
Optimization 6
Data Analytics and Machine Learning

Convexity and minimization problems Region above a convex function is convex
f(x+(1)y) f(x)+(1)f(y) hence f(x)+(1)f(y) 2 X for x,y 2 X
Convex functions have no local minima which are not global minima Proof by contradiction linear interpolation breaks local minimum
Each local minimum is a global minimum
zero gradient implies (local) minimum for convex functions
if f0 is a convex function and rf0() = 0 then is a gobal minimum
minimization becomes relatively easy
Optimization 7
Data Analytics and Machine Learning

First order convexity conditions (I)
Convexity imposes a rate of rise on the function f((1t)x+ty) (1t)f(x)+tf(y)
f (y) f (x) f ((1t)x+ty)f (x) t
Dierence between f(y) and f(x) is bounded by function values between x and y
Optimization 8
Data Analytics and Machine Learning

First order convexity conditions (II)
f (y) f (x) f ((1t)x+ty)f (x) t
Let t ! 0 and apply the definition of the derivative
f(y)f(x)(yx)Trf(x)
Theorem:
Suppose f : X ! R is a dierentiable function and X is convex. Then f is convex i for x, y 2 X
f(y) f(x) + (y x)T rf(x)
Proof. See Boyd p.70
Optimization 9
Data Analytics and Machine Learning

Verifying convexity (I)
Convexity makes optimization easier
How to verify whether a function is convex?
For example: ex1+2×2 +x1 log(x2) convex on [1,1)[1,1)?
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (I)
Convexity makes optimization easier
How to verify whether a function is convex?
For example: ex1+2×2 +x1 log(x2) convex on [1,1)[1,1)?
1. Prove whether the definition of convexity holds (See slide 6) 2. Exploit special results
First order convexity (See slide 9)
Example: A twice dierentiable function of one variable is convex on
an interval if and only if its second-derivative is non-negative on this
More general: a twice dierentiable function of several variables is
convex (on a convex set) if and only if its Hessian matrix is positive semidefinite (on the set)
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (II)
3. Show that the function can be obtained from simple convex functions by operations that preserve convexity
a) Start with simple convex functions, e.g.
f (x) = const and f (x) = xT b (there are also concave functions)
b) Apply construction rules(next slide)
Optimization 11
Data Analytics and Machine Learning

Convexity preserving operations
Letf1 :Rd !Randf2 :Rd !Rbeconvexfunctions,and g : Rd ! R be a concave function, then
h(x) = f1(x) + f2(x) is convex
h(x) = max{f1(x),f2(x)} is convex
h(x)=cf1(x)isconvexifc0
h(x)=cg(x)isconvexifc0
h(x) = f1(Ax + b) is convex (A matrix, b vector)
h(x)=m(f1(x))isconvexifm:R!Risconvexand nondecreasing
Example: ex1 +2×2 + x1 log(x2 ) is convex on, e.g., [1, 1) [1, 1)
Optimization 12
Data Analytics and Machine Learning

Verifying convexity of sets
1. Prove definition
often easier for sets than for functions
2. Apply intersection rule
Let A and B be convex sets, then A[B is a convex set
Optimization 13
Data Analytics and Machine Learning

An easy problem
Convex objective function f
Objective function dierentiable on its whole
i.e. we are able to compute gradient f0 at every point
We can solve f0() = 0 for analytically
i.e. solution for where gradient = 0 is
Unconstrained minimization
i.e. above computed solution for is valid
We are done!
Example: Ordinary Least Squares Regression
x2 +ex3 2x+7
Optimization 14
Data Analytics and Machine Learning

Unfortunately, many problems are harder No analytical solution for f0() = 0
e.g. Logistic Regression
Solution: try numerical approaches, e.g. gradient descent Constraint on
e.g. f0() = 0 only holds for points outside the domain
Solution: constrained optimization
f not dierentiable on the whole domain
Potential solution: subgradients; or is it a discrete optimization
f not convex
Potential solution: convex relaxations; convex in some variables?
Optimization 15
Data Analytics and Machine Learning

One-dimensional problems
Key Idea
For dierentiable f search for with rf() = 0
Interval bisection (derivative is monotonic)
Require: a, b, Precision Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|)
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

One-dimensional problems
Key Idea
For dierentiable f search for with rf() = 0
Interval bisection (derivative is monotonic)
Can be extended to nondierentiable problems
Require: a, b, Precision Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|)
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

One-dimensional problems
Key Idea
For dierentiable f search for with rf() = 0
Interval bisection (derivative is monotonic)
Can be extended to nondierentiable problems
exploit convexity in upper bound and keep 5 points
Require: a, b, Precision Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|)
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

Gradient Descent
Key Idea
Gradient points into steepest ascent direction
Locally, the gradient is a good approximation of the objective function
GD with Line Search
Get descent direction, then unconstrained line search
Turn a multidimensional problem into a one-dimensional problem that we already know how to solve
given a starting point 2 Dom(f)
1. := rf()
2. Line search. t = arg mint>0 f( + t ) 3. Update. := + t
until stopping criterion is satisfied.
Optimization 17
Data Analytics and Machine Learning

Gradient Descent convergence
Let p be the optimal value, be the minimizer the point where the minimum is obtained, and (0) be the starting point
For strongly convex f (replace with > in the definition of convexity) the residual error , for the k-th iteration is:
=f((k))p ck(f((0))p), c<1 f((k)) converges to p as k ! 1 We must have f((k)) p after at most log((f((0))p))/ log(1/c)iterations Linear convergence for strongly convex objective k log(1) // k = number of iterations, Linear convergence for strongly convex objective i.e. linear when plotting on a log scale – old statistics terminologyOptimization 18Data Analytics and Machine Learning Distributed/Parallel implementation f()= iLi()+g() where i iterates over, e.g., each data instance Example OLS regression: // with regularization Li(w)=(xTi wyi)2 g(w)=kwk2 Gradient can simple be decomposed based on the sum rule Easy to parallelize/distributeOften problems are of the formOptimization 19Data Analytics and Machine Learning Basic stepsgiven a starting point 2 Dom(f)1. := rf()2. Line search. t = arg mint>0 f( + t ) 3. Update. := + t
until stopping criterion is satisfied.
Distribute data over several machines
Compute partial gradients (on each machine in parallel) Aggregate the partial gradients to the final one
BUT: Line search is expensive
for each tested step size: scan through all datapoints
easy parallel computation
evaluating function might be done multiple times: expensive!
Optimization 20
Data Analytics and Machine Learning

Scalability analysis
+ Linear time in number of instances
+ Linear memory consumption in problem size (not data) + Logarithmic time in accuracy
+ Perfect scalability
Multiple passes through dataset for each iteration
Optimization 21
Data Analytics and Machine Learning

A faster algorithm
Avoid the line search; simply pick update
t+1 t rf(t)
is often called the learning rate
Only a single pass through data per iteration Logarithmic iteration bound (as before)
if learning rate is chosen correctly
How to pick the learning rate?
too small: slow convergence
too high: algorithm might oscillate, no convergence
Interactive tutorial on optimization
http://www.benfrederickson.com/numerical-optimization/
Optimization 22
Data Analytics and Machine Learning

The value of
A too small value for has two drawbacks
We find the minimum more slowly
We end up in local minima or saddle/flat points
Optimization 23
Data Analytics and Machine Learning

The value of
A too large value for has one drawback
You may never find a minimum; oscillations usually occur
We only need 1 step to overshoot
Optimization 24
Data Analytics and Machine Learning

Learning rate adaptation
Simple solution: let the learning rate be a decreasing function t of the iteration number t
so called learning rate schedule
first iterations cause large changes in the parameters; later do
fine-tuning
convergence easily guaranteed if limt!1 t = 0
example: t+1 t for0<<1Optimization 25Data Analytics and Machine Learning Learning rate adaptation Other solutions: Incorporate historyof previous gradients Momentum: mt rf(t)+mt1 //often=0.5 t+1 t mt As long as gradients point to the same direction, the search accelerates AdaGrad: dierent learning rate per parameter learning rate depends inversely on accumulated strengthof all previously computed gradients large parameter updates (largegradients) lead to small learning ratesOptimization 26Data Analytics and Machine Learning Adaptive moment estimation (Adam) mt = 1mt1 + (1 1)rf(t) estimate of the first moment (mean) of the gradient Exponentially decaying average of past gradients mt (similar to vt = 2vt1 + (1 2)(rf(t))2 estimate of the second moment (uncentered variance) of the gradient exponentially decaying average of past squared gradients vt To avoid bias towards zero (due to 0s initialization) use bias-corrected version instead: m t = m t t v t = v t t 11 12Finally, the Adam update rule for parameters : t + 1 = t p m tDefault values: 1 = 0.9, 2 = 0.999, = 108Optimization 27Data Analytics and Machine Learning Visualizing gradient descent variants AdaGrad and variants http://sebastianruder.com/optimizing- often have faster convergence gradient-descent/ might help to escape saddlepointsOptimization 28Data Analytics and Machine Learning Discussion Gradient descent and similar techniques are called first-order optimization techniques they only exploit information of the gradients (i.e. first order derivative) Higher-order techniques use higher-order derivatives e.g. second-order = Hessian matrix Example: Newton MethodOptimization 29Data Analytics and Machine Learning Newton method Convex objective function f Nonnegative second derivative: r2f() 0 // Hessian matrix r2f() 0 means that the Hessian is positive semidefinite Taylor expansion of f at point tf ( t + ) = f ( t ) + T r r r f ( t ) + 12 T r r r 2 f ( t ) + O ( 3 )Optimization 30Data Analytics and Machine Learning Newton method Convex objective function f Nonnegative second derivative: r2f() 0 // Hessian matrix r2f() 0 means that the Hessian is positive semidefinite Taylor expansion of f at point tf ( t + ) = f ( t ) + T r r r f ( t ) + 12 T r r r 2 f ( t ) + O ( 3 ) Minimize approximation: leads tot+1 t [r2f(t)]1rf(t)Repeat until convergenceOptimization 30Data Analytics and Machine Learning Parallel Newton method+ Good rate for convergence+ Few passes through data needed+ Parallel aggregation of gradient and Hessian + Gradient requires O(d) data- Hessian requires O(d2) data- Update step is O(d3) & nontrivial to parallelize Use it only for low dimensional problems!Optimization 31Data Analytics and Machine Learning Large scale optimization Higher-order techniques have nice properties (e.g. convergence) but they are prohibitively expensive for high dimensional problems For large scale data / high dimensional problems use first-order techniques i.e. variants of gradient descent But for real-world large scale data even first-order methods are too Solution: Stochastic optimization!Optimization 32Data Analytics and Machine Learning Motivation: Stochastic Gradient Descent Goal: minimize f() = Pni=1 Li() + potential constraints For very large data: even a single pass through the data is very costly Lots of time required to even compute the very first gradient Is it possible to update the parameters more frequently/faster?Optimization 33Data Analytics and Machine Learning Stochastic Gradient Descent Consider the task as empirical risk minimization 1 (X nLi()) = E [Li()]n i=1 i{1,…,n}(Exact) expectation can be approximated by smaller sample: Ei{1,…,n}[Li()] 1 j2S(Lj())or equivalently: Pn Li() n P i=1 |S| j2S// with S {1,…,n} Lj()Optimization 34Data Analytics and Machine Learning Stochastic Gradient Descent Intuition: Instead of using exactgradient, compute only a noisy (but still unbiased) estimate based on smaller sample Stochastic gradient decent:1. randomly pick a (small) subset S of the points ! so called mini-batch2. compute gradient based on mini-batch3. update:t+1 t n P rLj(t)|S| j2S 4. pick a new subset and repeat with 2 OriginalSGD uses mini-batches of size 1 larger mini-batches lead to more stable gradients (i.e. smallervariance in the estimated gradient) In many cases, the data is sampled so that we dont see any data point twice. Then, each full iteration over the complete data set is called one epoch.Optimization 35Data Analytics and Machine Learning Example: Perceptron Simple linear binary classifier: (x) 1 if wTx+b>0
Learning task:
Given (x1, y1P), . . . , (xn, yn) with yi{1, 1} Find minw,b i L(yi, wT xi + b)
L is the loss function, with > 0
e.g. L(u,v) = max(0,uv) = uv 0
if uv < elseOptimization 36Data Analytics and Machine Learning Example: Perceptron Simple linear binary classifier: (x) 1 if wTx+b>0
Learning task:
Given (x1, y1P), . . . , (xn, yn) with yi{1, 1} Find minw,b i L(yi, wT xi + b)
L is the loss function, with > 0
e.g. L(u,v) = max(0,uv) = uv 0
if uv < elseincorrect prediction correct predictionOptimization 36Data Analytics and Machine Learning Example: Perceptron Lets solve this problem via SGDinitialize w = 0 and b = 0 repeatif yi (wT xi + b) < thenw w+ nyi xi and buntil all classified correctly Note: Nothing happens if classified correctly gradient is zero Does this remind you of the original learning rules for perceptron?Optimization 37Data Analytics and Machine Learning Convergence in expectation Subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex almost surely to a local minimum for non-convex functions The expectation of the residual error decreases with speedE[] t1 // i.e. t E[]1 Note: Standard GD has speed t log 1 faster convergence speed; but each iteration takes longerOptimization 38Data Analytics and Machine Learning Optimizing Logistic Regression Recall we wanted to solve w = arg minw E(w) E(w)=lnp(y|w,X) Closed form solution does not exist Solution: Compute the gradient rE(w) Find w using gradient descent Is E(w) convex? Can you use SGD? How can you choose the learning rate? What changes if we add regularization, i.e. Ereg (w) = E(w) + kwk2 ?yi ln (wT xi) + (1 yi) ln(1 (wT xi))Optimization 39Data Analytics and Machine Learning Large-Scale Learning – Distributed Learning So far, we (mainly) assumed a single machine SGD achieves speed-up by only operating on a subset of the data Might still be too slow when operating with really large data and large models In practice: We have often multiple machines available V Distributed learning Distribute computation across multiple machines Core challenge: distribute work so that communication doesnt killOptimization 40Data Analytics and Machine Learning Distributed Learning: Data vs. Model Parallelism Use multiple model replicas to process dierent examples at the same time all collaborate to update model state (parameters) in sh CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] Machine Learning

Reviews

Whatsapp Us

[SOLVED] Machine Learning

Reviews

Related products

[Solved] Modularized Body Mass Index (BMI) Program in Python

[SOLVED] Project 9-1: Monthly Payment Calculator

[SOLVED] Programming Project for TCP Socket Programming

[Solved] Assignment 9 Changing Sort Keys

[SOLVED] Basic Calculator Client and Server

[Solved] List Maintainer