[SOLVED] CS Outline

$25

File Name: CS_Outline.zip
File Size: 94.2 KB

5/5 - (1 vote)

Outline
Model Selection Occams Razor
Quantifying Model Complexity Poppers Prediction Strength
Cross-Validation
Dataset Bias and The Clever Hans Effect Bias-Variance Analysis
1/32

Learning a Model of the Data
Assuming some observations from some unknown true data distribution p(x,y), we would like to find a model such that some distance D(p,p) is minimized. For regression tasks, we can consider the simpler objective:
min Ep[y|x]f(x)2 p(x)dx
where p is some guess of the true p, e.g. the empirical distribution. 1. observations 2. fit a model 3. make predictions
2/32

Model Selection
Questions:
1. Among models that correctly predict the data, which one should be retained?
2. Should we always choose a model that perfectly fits the data?
3/32

Occams Razor
William of Ockham (12871347)
Entia non sunt multiplicanda praeter necessitatem
English translation
Entities must not be multiplied beyond necessity.
4/32

Interpreting Occams Razor
What advantages do we expect from a model based on few assumptions?
If two models correctly predict the data, the one that makes fewer assumptions should be preferred because simplicity is desirable in itself.
If two models correctly predict the data, the one that makes fewer assumptions should be preferred because it is likely to have lower generalization error.
Further reading:
Domingos (1998) Occams two Razors: The Sharp and the Blunt.
5/32

Occams Razor for Model Selection
training error
number of assumptions
should be discarded according to Ockhams razor
6/32

How to Quantify Few Assumptions?
Many possibilities:
Number of free parameters of the model Minimum description length (MDL)
.
Size of function class (structural risk minimization) VC-Dimension (next week)
7/32

Number of Parameters of the Model
Constant classifier
g(x) = C
1
Nearest mean classifier
g(x)=x (m1m2)+ C
O(1) parameters O(d)parameters
2d 1
Fisher vector classifier (on top of k PCA components):
g(x)=PCA(x) S1 (m1m2)+ C O(kd)parameters W

1
kd k2 2k
8/32

Number of Parameters of the Model
Counter-example
Thetwo-parametersmodelg(x)=asin(x)canfitalmostany finite dataset in R, by setting very large values for .
However, it is also clear that the model will not generalize well.
By only counting the number of parameters in the model, we have not
specified the range of values the parameter is allowed to take in practice.
9/32

Structural Risk Minimization Vapnik and Chervonenkis (1974)
Idea: Structure your space of solutions into a nesting of increasingly larger regions. If two solutions fit the data, prefer the solution that belongs to the smaller region.
Example:
Assuming Rd is the whole solution space for , build a sequence of real numbers C1 < C2 < < CN , and create a nested collection (S1,S2,…,SN) of regions, whereSn =Rd:2 Cn10/32Connection to RegularizationExample (cont.): We optimize for multiple Cn the objectiveNs.t. 2 < Cnand discard solutions with index n larger than necessary to fit the data.min(yi,f(xi)) i=1Eempirical This objective can be equivalently formulated as:with appropriate (n)n. This objective is known in various contexts as L2 regularization, ridge regression, large margin, weight decay, etc.min(yi , f(xi )) + N i=1n2 Eempirical Ereg11/32From Occams Razor to PopperOccams RazorEntities must not be multiplied beyond necessity.Falsifiability/prediction strength (S. Hawking, after K. Popper)[a good model] must accurately describe a large class of ob- servations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.In other words, the model with lowest generalization error is preferable. 12/32From Occams Razor to Poppertest error number of assumptions best model according to Poppershould be discarded according to Ockham’s razor13/32From Occams Razor to Poppertest errornumber of assumptionsshould be discarded according to Ockham’s razorbest model according to Popper14/32The Holdout Selection ProcedureIdea: Predict out-of-sample error by splitting the data randomly in two parts (one for training, and one for estimating the error of the model).models errortraining data datasetholdout dataerrorargminerror 15/32Cross-Validation (k-Fold Procedure)To improve error estimates without consuming too much data, the pro- cesses of error estimation can be improved by computing an average over different splits:datasetk-partition models modelholdout selection modelselectionmodel selectionmodel selection trainavg16/32The Cross-Validation ProcedureAdvantages: The model can now be selected directly based on simulated future observations.Limitations: For a small number of folds k, the training data is reduced significantly, which may lead to less accurate models. For k large, the procedure becomes computationally costly. This technique assumes that the available data is representative of the future observations (not always true!).17/32The Problem of Dataset Bias test data training dataFisher discriminantnearest mean classifierThis effect can lead cross-validation procedure to not work well, even when we have enough training data. 18/32The Problem of Dataset Bias Observation: The classifier has exploited a spurious correlation between images of the class horse and the presence of a copyright tag in the left bottom corner of the horse images. Here, cross-validation doesnt help here because the spurious correlation would also be present in the validation set.Further reading: Lapuschkin et al. (2019) Unmasking Clever Hans predictors and assessing what machines really learn19/32 Part II. Bias-Variance Analysis of ML Models20/32 Machine Learning ModelsMachine learning models are learned from the data to approximate some truth .A learning machine can be abstracted as a function that maps a dataset D to an estimator of the truth . 21/32ML Models and Prediction Error A good learning machine is one that produces an estimator close to the truth . Closeness to the truth can be measured by some error function, e.g. the square error:Error() = ( )2. 22/32Bias, Variance, and MSE of an EstimatorParametric estimation: isavalueinRh is a function of the data D = {X1,…,XN}, where Xi are random variables producing the data points.Statistics of the estimator:Bias() = E[ ] 2(measures expected deviation of the mean) (measures scatter around estimator of mean) (measures prediction error)Var() = E[( E[]) ] 2MSE() = E[( ) ]Note: for Rh, we use the notation 2 = .23/32Visualizing Bias and Variancelow bias high biasTrue parameterParameter estimator24/32 high variance low variance Bias-Variance Decomposition22 Bias() = E[ ], Var() = E[( E[]) ], MSE() = E[( ) ]2 We can show that MSE() = Bias() + Var() . 25/32Visualizing Bias and Variance prediction errorlow variance/ high biaserrorbias variancelow bias/ high variancemodel complexity26/32Example: Parameters of a Gaussian 27/32Example: Parameters of a GaussianThe natural estimator of mean = 1 N Xi decomposes to N i=1Bias() = 0 and Var() = 2/N .28/32The James-Stein Estimator 29/32Estimator of Functions 30/32Bias-Variance Analysis of the Function Estimator (locally) 31/32Summary Occams Razor: Given two models with the same training error, the simpler one should be preferred. Poppers View: How to make sure that a model predicts well? By testing it on out-of-sample data. Out-of-sample data can be simulated by applying a k-fold cross-validation procedure. Bias-Variance Decomposition: The error of a predictive model can be decomposed into bias and variance. Best models often results from some tradeoff between the two terms.32/32

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS Outline
$25