[SOLVED] CS计算机代考程序代写 Outline

30 $

File Name: CS计算机代考程序代写_Outline.zip
File Size: 348.54 KB

SKU: 5213079180 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Outline
� Model Selection � Occam’s Razor
� Quantifying Model Complexity � Popper’s Prediction Strength
� Cross-Validation
� Dataset Bias and The ‘Clever Hans’ Effect � Bias-Variance Analysis
1/32

Learning a Model of the Data
Assuming some observations from some unknown true data distribution p(x,y), we would like to find a model θ such that some distance D(p,pθ) is minimized. For regression tasks, we can consider the simpler objective:
min� �Epˆ[y|x]−fθ(x)�2 ·pˆ(x)·dx θ
where pˆ is some guess of the true p, e.g. the empirical distribution. 1. observations 2. fit a model 3. make predictions
2/32

Model Selection
Questions:
1. Among models that correctly predict the data, which one should be retained?
2. Should we always choose a model that perfectly fits the data?
3/32

Occam’s Razor
William of Ockham (1287–1347)
“Entia non sunt multiplicanda praeter necessitatem”
English translation
“Entities must not be multiplied beyond necessity.”
4/32

Interpreting Occam’s Razor
What advantages do we expect from a model based on few assumptions?
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because simplicity is desirable in itself.
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because it is likely to have lower generalization error.
Further reading:
Domingos (1998) Occam’s two Razors: The Sharp and the Blunt.
5/32

Occam’s Razor for Model Selection
training error
number of assumptions
should be discarded according to Ockham’s razor
6/32

How to Quantify “Few Assumptions”?
Many possibilities:
� Number of free parameters of the model � Minimum description length (MDL)
� .
� Size of function class (structural risk minimization) � VC-Dimension (next week)
7/32

Number of Parameters of the Model
Constant classifier
g(x) = C ����
1
Nearest mean classifier
g(x)=x� (m1−m2)+ C � �� � ����
O(1) parameters O(d)parameters
2d 1
Fisher vector classifier (on top of k PCA components):
g(x)=PCA(x)� S−1 (m1−m2)+ C O(k·d)parameters W ����
� �� � ���� � �� �
1
k·d k2 2k
8/32

Number of Parameters of the Model
Counter-example
� Thetwo-parametersmodelg(x)=asin(ωx)canfitalmostany finite dataset in R, by setting very large values for ω.
� However, it is also clear that the model will not generalize well.
By only counting the number of parameters in the model, we have not
specified the range of values the parameter ω is allowed to take in practice.
9/32

Structural Risk Minimization Vapnik and Chervonenkis (1974)
Idea: Structure your space of solutions into a nesting of increasingly larger regions. If two solutions fit the data, prefer the solution that belongs to the smaller region.
Example:
Assuming Rd is the whole solution space for θ, build a sequence of real numbers C1 < C2 < · · · < CN , and create a nested collection (S1,S2,…,SN) of regions, whereSn =�θ∈Rd:�θ�2 ≤Cn�10/32Connection to RegularizationExample (cont.): We optimize for multiple Cn the objective�Ns.t. �θ�2 < Cnand discard solutions with index n larger than necessary to fit the data.minθ�(yi,fθ(xi)) � �� �i=1Eempirical This objective can be equivalently formulated as:�with appropriate (λn)n. This objective is known in various contexts as L2 regularization, ridge regression, large margin, weight decay, etc.minθ�(yi , fθ(xi )) +� �� � ����� �N i=1λn�θ�2 Eempirical Ereg11/32From Occam’s Razor to PopperOccam’s Razor“Entities must not be multiplied beyond necessity.”Falsifiability/prediction strength (S. Hawking, after K. Popper)“[a good model] must accurately describe a large class of ob- servations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.”In other words, the model with lowest generalization error is preferable. 12/32From Occam’s Razor to Poppertest error number of assumptions best model according to Poppershould be discarded according to Ockham’s razor13/32From Occam’s Razor to Poppertest errornumber of assumptionsshould be discarded according to Ockham’s razorbest model according to Popper14/32The Holdout Selection ProcedureIdea: Predict out-of-sample error by splitting the data randomly in two parts (one for training, and one for estimating the error of the model).models errortraining data datasetholdout dataerrorargminerror 15/32Cross-Validation (k-Fold Procedure)To improve error estimates without consuming too much data, the pro- cesses of error estimation can be improved by computing an average over different splits:datasetk-partition models modelholdout selection modelselectionmodel selectionmodel selection trainavg16/32The Cross-Validation ProcedureAdvantages:� The model can now be selected directly based on simulated future observations.Limitations:� For a small number of folds k, the training data is reduced significantly, which may lead to less accurate models. For k large, the procedure becomes computationally costly.� This technique assumes that the available data is representative of the future observations (not always true!).17/32The Problem of Dataset Bias test data training dataFisher discriminantnearest mean classifierThis effect can lead cross-validation procedure to not work well, even when we have enough training data. 18/32The Problem of Dataset Bias Observation: The classifier has exploited a spurious correlation between images of the class horse and the presence of a copyright tag in the left bottom corner of the horse images. Here, cross-validation doesn’t help here because the spurious correlation would also be present in the validation set.Further reading: Lapuschkin et al. (2019) Unmasking Clever Hans predictors and assessing what machines really learn19/32 Part II. Bias-Variance Analysis of ML Models20/32 Machine Learning ModelsMachine learning models are learned from the data to approximate some truth θ.A learning machine can be abstracted as a function that maps a dataset D to an estimator θˆ of the truth θ. 21/32ML Models and Prediction Error A good learning machine is one that produces an estimator θˆ close to the truth θ. Closeness to the truth can be measured by some error function, e.g. the square error:Error(θˆ) = (θ − θˆ)2. 22/32Bias, Variance, and MSE of an EstimatorParametric estimation:� θisavalueinRh� θ�is a function of the data D = {X1,…,XN}, where Xi are random variables producing the data points.Statistics of the estimator:��Bias(θ) = E[θ − θ] ���2(measures expected deviation of the mean) (measures scatter around estimator of mean) (measures prediction error)Var(θ) = E[(θ − E[θ]) ] ��2MSE(θ) = E[(θ − θ) ]Note: for θ ∈ Rh, we use the notation θ2 = θ�θ.23/32Visualizing Bias and Variancelow bias high biasTrue parameterParameter estimator24/32 high variance low variance Bias-Variance Decomposition�����2��2 Bias(θ) = E[θ − θ], Var(θ) = E[(θ − E[θ]) ], MSE(θ) = E[(θ − θ) ]��2� We can show that MSE(θ) = Bias(θ) + Var(θ) . 25/32Visualizing Bias and Variance prediction errorlow variance/ high biaserrorbias variancelow bias/ high variancemodel complexity26/32Example: Parameters of a Gaussian 27/32Example: Parameters of a GaussianThe ‘natural’ estimator of mean μ� = 1 �N Xi decomposes to N i=1Bias(μ�) = 0 and Var(μ�) = σ2/N .28/32The James-Stein Estimator 29/32Estimator of Functions 30/32Bias-Variance Analysis of the Function Estimator (locally) 31/32Summary� Occam’s Razor: Given two models with the same training error, the simpler one should be preferred.� Popper’s View: How to make sure that a model predicts well? By testing it on out-of-sample data. Out-of-sample data can be simulated by applying a k-fold cross-validation procedure.� Bias-Variance Decomposition: The error of a predictive model can be decomposed into bias and variance. Best models often results from some tradeoff between the two terms.32/32

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS计算机代考程序代写 Outline
30 $