[SOLVED] html python database graph statistic APS1070

$25

File Name: html_python_database_graph_statistic_APS1070.zip
File Size: 414.48 KB

5/5 - (1 vote)

APS1070
Linear Regression Project 4

Please fill out the following:
Your name:
Your student number:

How to use this Notebook

1.Run the cells to get a birds eye view of whats going on.
2.Interact with the code to better understand how it works add cells, plot values, change parameters, etc.
3.(Optional) Replicate Linear Regression code on a blank notebook, without viewing this code. This is called retrieval practice.
4.Complete exercises.

Linear Regression Introduction

Linear Regression

In statistics, linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. Let X be the independent variable and Y be the dependent variable. We will define a linear relationship between these two variables as follows:

$y = mX + c$

This is the equation for a line that you studied in high school. m is the slope of the line and c is the y intercept. Today we will use this equation to train our model with a given dataset and predict the value of Y for any given value of X. Our challenge today is to determine the value of m and c, such that the line corresponding to those values is the best fitting line or gives the minimum error.

Are we always fitting a line into the data? What if we have more than 1 feature (independant variable)?

The vectorized form of above equation is written as $y = Xw$, where y and w are vectors while X is a matrix.
Where is the c term? It is included within the X matrix.

Hypothesis of Linear Regression

The linear regression model can be represented by the following equation:

Y is the predicted value ($h_theta(x)$)
is the bias term.
,, are the model parameters
x, x,,x are the feature values.

LR Procedure

Our Loss function for Linear regression would be sum of squares, which makes the cost function to be:

Here y is the actual value and is the predicted value. Lets substitute the value of from $y = mX + c$:

Partial Derivative of the above equation with respect to m is shown as $D_m$:

While the Partial Derivative with respect to c is shown as $D_c$:

Analytical Approach for Linear Regression
Can you derive the Analytical Solution for Linear Regression?

We arrive at the analytical solution when we turn the partial derivatives with respect to the parameters to zero. Why?
This is because at the point where cost function is at the minimum with respect to the parameters (m and c), the derivative of cost function with respect to the parameters would be zero.

$D_m=0$ and $D_c=0$

The solution for the general case comes out to be:

With $L_2$ Regularization. Cost:

$J(theta) = frac{1}{2m}[sum_{i=1}^m(h_theta (x^{(i)}) y^{(i)})^2 + lambdasum_{j=1}^ntheta^2_j]$

$w = (X^TX + lambda I)^{-1}X^Ty$

Gradient Descent Solution

Gradient descent uses the equations for gradient derived above to find the direction in which we tinker the values of our parameters m and c.

Here the term $L$ is defined as the learning rate.

Any other solutions methods?

Some other solution methods are QR Decomposition, Singular-Value Decomposition.
Do these methods give exact solutions?
Do these methods scale well with large datasets?

Lets Start

Single Feature, 5 data points.

Analytical Approach in Code
Lets get our hands dirty.
In[0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
In[0]:
n = 5 # number of points
m = 2 # slope of line
c = 4 # y intercept of line
r = 2 # range of data on x-axis

np.random.seed(2)
x = r * np.random.rand(n)
x.sort()
yPerf = m*x + c # perfect world: no noise
y = m*x + c + np.random.randn(n) # imperfect world: we get noisy data
In[0]:
x
Out[0]:
array([0.05185246, 0.8407356 , 0.87064479, 0.8719898 , 1.09932496])

Spoiler Alert: We are in an Imperfect World.
In[0]:
def rmse(yPred, y):
return np.sqrt(mean_squared_error(yPred, y))
In[0]:
plt.plot(x, y, o, label=Data)
plt.plot(x, yPerf, o-, label=Underlying Distribution)

plt.legend()
plt.show()

In[0]:
x = np.vstack((np.ones(np.shape(x)), x)).T
y = y.reshape(-1, 1)
In[0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))

# prediction
yPred = np.dot(x, w)
In[0]:
w[0], w[1]
Out[0]:
(array([2.67741212]), array([3.35314632]))

How similar are these to the values we had set initially?
Will they be same as those set initially if there was no noise?
In[0]:
plt.plot(x[:, 1], y, o, label=Data)
plt.plot(x[:, 1], yPerf, o-, label=Underlying Distribution)
plt.plot(x[:, 1], yPred, o-, label=Predicted)

plt.legend()
plt.show()

In[0]:
print(RMSE: , rmse(yPred, y))

RMSE:0.8321853483360432

Complex Dependant Variable
Its complicated.
In[0]:
n = 20 # number of points
m = 3# slope of line
c = 10 # y intercept of line
r = 2# range of data on x-axis

np.random.seed(10)
X = xD = r * np.random.randn(n) # points also stored in xD (xData). will be useful later.
X.sort()
yPerf = X 2 * (X ** 2) + 0.5 * (X ** 3)
np.random.seed(10)
y = X 2 * (X ** 2) + 0.5 * (X ** 3) + np.random.normal(0, 5, n) # imperfect world: we get noisy data
In[0]:
plt.plot(X, y, o, label=Data)
plt.plot(X, yPerf, o-, label=Underlying Distribution)

plt.legend()
plt.show()

In[0]:
X = np.vstack((np.ones(np.shape(X)), X)).T
y = y.reshape(-1, 1)
In[0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))

# prediction
yPredLinear = yPred = np.dot(X, w)

w
Out[0]:
array([[-5.45917353],
[ 3.82202654]])
In[0]:
plt.plot(xD, y, o, label=Data)
plt.plot(xD, yPerf, o-, label=Underlying Distribution)
plt.plot(xD, yPred, o-, label=Predicted)

plt.legend()
plt.show()

In[0]:
print(RMSE: , rmse(yPred, y))

RMSE:6.839531223681816

What to do next?
Can we add more features.
In[0]:
X = np.vstack((X.T, xD**2, xD**3)).T
In[0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))

# prediction
yPred4Feature = yPred = np.dot(X, w)
In[0]:
w
Out[0]:
array([[ 0.04177438],
[ 2.58652514],
[-1.87142105],
[ 0.19655185]])
In[0]:
plt.plot(xD, y, o, label=Data)
plt.plot(xD, yPerf, o-, label=Underlying Distribution)
plt.plot(xD, yPred, o-, label=Predicted)

plt.legend()
plt.show()

In[0]:
print(RMSE: , rmse(yPred, y))

RMSE:3.929493007090521

But how do we know when to stop, since we would not be knowing when to stop adding features in x.
In[0]:
X = np.vstack((X.T, xD**4, xD**5, xD**6, xD**7)).T
In[0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))

# prediction
yPred8Feature = yPred = np.dot(X, w)
In[0]:
w
Out[0]:
array([[-2.64611130e-01],
[ 8.28612458e+00],
[-2.07703327e+00],
[-6.35661778e+00],
[-1.30720676e-02],
[ 1.64293567e+00],
[-2.08380043e-03],
[-1.09229711e-01]])
In[0]:
plt.plot(xD, y, o, label=Data)
plt.plot(xD, yPerf, o-, label=Underlying Distribution)
plt.plot(xD, yPred, o-, label=Predicted)

plt.legend()
plt.show()

In[0]:
print(RMSE: , rmse(yPred, y))

RMSE:3.3097843019929725

Even though RMSE here (with much more higher features) is lower than previous case, we can see this Predicted Line is quite far from the Underlying distribution. This is called Overfitting.

When does the model overfits: when we have more features or comparitively less data for the model.
What exactly is overfitting: It pays more attention to the noise of the data provided, in a sense trying to rotely memorize everything, without generalizing.
Since we dont know when to stop adding features, what can be done:
Solution is to work with a model or feature set that can slightly overfit your data, and then use techniques to prevent overfitting from happening. The alternative gives us underfitting which we cannot fix unless you modify the feature set or model.
Options we have to prevent overfitting. Well there are many, most widely used ones are
Using a validation set
Regularization: add penalty on weights
In[0]:
= 10 # what is lambda: regularization parameter
f = 8 # number of features
In[0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X) + ()*np.identity(f)), np.dot(X.T, y))

# prediction
yPred8FeatRegu = yPred = np.dot(X, w)
w
Out[0]:
array([[ 0.21326186],
[ 0.40727946],
[-0.46745243],
[-0.31891582],
[-0.51080301],
[ 0.38864889],
[ 0.03613141],
[-0.03367748]])
In[0]:
plt.plot(xD, y, o, label=Data)
plt.plot(xD, yPerf, o-, label=Underlying Distribution)
plt.plot(xD, yPred, o-, label=Predicted)

plt.legend()
plt.show()

Can we know from the plot if the value of is optimal:
Somewhat but not exactly.
To get the exact value of lambda you need to split dataset between training and testing. Then cycle over multiple values of lambda. The most optimum is the one which gives the lowest test error.
What does low test error represent?

All models together:
In[0]:
plt.figure(figsize=(12, 8))
plt.plot(xD, y, o, label=Data)
plt.plot(xD, yPerf, o-, label=Underlying Distribution)
plt.plot(xD, yPredLinear, o-, label=2 Feature)
plt.plot(xD, yPred4Feature, o-, label=4 Feature)
plt.plot(xD, yPred8Feature, o-, label=8 Feature)
plt.plot(xD, yPred8FeatRegu, o-, label=8 Feature Regularized)

plt.legend()
plt.show()

Further Complex Multivariate Data

We will take the Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.
In[0]:
from sklearn.datasets import load_boston
dataset = load_boston()
In[0]:
print(dataset.keys())

dict_keys([data, target, feature_names, DESCR, filename])
In[0]:
print(dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUSproportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOXnitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGEproportion of owner-occupied units built prior to 1940
DISweighted distances to five Boston employment centres
RADindex of accessibility to radial highways
TAXfull-value property-tax rate per $10,000
PTRATIOpupil-teacher ratio by town
B1000(Bk 0.63)^2 where Bk is the proportion of blacks by town
LSTAT% lower status of the population
MEDV Median value of owner-occupied homes in $1000s

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. Hedonic
prices and the demand for clean air, J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, Regression diagnostics
, Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

Belsley, Kuh & Welsch, Regression diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley, 1980. 244-261.
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In[0]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
Out[0]:

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33

We can see that the target value by the name MEDV is missing from the dataframe. It is present in dataset.target. Lets add that to the df.
In[0]:
df[MEDV] = dataset.target

Are there any null values in the dataset?
In[0]:
df.isnull().sum()
Out[0]:
CRIM 0
ZN 0
INDUS0
CHAS 0
NOX0
RM 0
AGE0
DIS0
RAD0
TAX0
PTRATIO0
B0
LSTAT0
MEDV 0
dtype: int64

Even though there are none, its a good practice to check.

Exploratory Data Analysis

We here use distplot method from the seaborn library, which is another popular graphing library for Python.
In[0]:
sns.set(rc={figure.figsize:(12, 9)})
sns.distplot(df[MEDV], bins=30)
plt.show()

We can see that it mostly normally distributed with a few outliers.

Linear Regression Assignment

Well work with the above dataset for the assignment below. Create a blank notebook for the assignment.

Question 1

1.Normalize the data using StandardScaler from sklearn. [0.5]
2.Split the multivariate data into training/validation data (20% validation data) and plot train error and validation error as a function of the lambda regularizer. [1]
3.What is the best lambda where the best fit occurs? Use this best lambda from here on out. [Note: You may be surprised by the best lambda here this wont be the case for all datasets.] [1]
In[0]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
In[0]:
#load the data
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
In[3]:
df.head()#506 rows 13 columns
Out[3]:

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
In[0]:
df[MEDV] = dataset.target
In[0]:
X = df.iloc[:, :-1]
scale = StandardScaler().fit_transform(X)
y=np.array(dataset.target)
bias = np.ones((X.shape[0],1))
X_std = np.concatenate((bias,scale),axis=1)#506*14
In[0]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(X_std, y, test_size = 0.2, random_state=1)
def rmse(yPred, y):
return np.sqrt(mean_squared_error(yPred, y))
In[7]:
from sklearn.model_selection import train_test_split
train_rmse=[]
val_rmse=[]
s=np.arange(0,100,0.1)#[0,0.01,0.1,1,5,10,20,100]
#xD_train=x_train
#xD_val=x_val;
for in s:
#x_train=xD_train
#x_val=xD_val

#x_train_ = np.hstack((np.ones(np.shape(x_train[:,0].reshape(-1,1))), x_train))

f=x_train.shape[1]

w = np.dot(np.linalg.inv(np.dot(x_train.T, x_train) + ()*np.identity(f)), np.dot(x_train.T, y_train))

y_train_pred=np.dot(x_train,w)
#print( ={} Train RMSE : .format(), rmse(y_train_pred, y_train))

#x_val_ = np.hstack((np.ones(np.shape(x_val[:,0].reshape(-1,1))), x_val))

y_val_pred=np.dot(x_val,w)
train_rmse.append(rmse(y_train_pred, y_train))
val_rmse.append(rmse(y_val_pred, y_val))
#print(={} Test RMSE : .format(), rmse(y_val_pred, y_val))
plt.figure(figsize=(12, 8))
plt.plot(s, train_rmse, label=train_rmses)
plt.plot(s, val_rmse, label=test_rmses)
plt.legend()
plt.show()

Thus, the best is 0.

Question 2

1.Derive the gradient of the $L_2$ regularized objective for Linear Regression, $frac{partial}{partial theta_j}J(theta)$. Show your work (either by using in-line equations like we do here, or taking a picture of your work, and emdedding it in the notebook). [1]

$J(theta) = frac{1}{2m}[sum_{i=1}^m(h_theta (x^{(i)}) y^{(i)})^2 + lambdasum_{j=1}^ntheta^2_j=frac{1}{2m}[sum_{i=1}^m(theta_j (x^{(i)}) y^{(i)})^2 + lambdasum_{j=1}^ntheta^2_j]$
$frac{partial}{partial theta_j}J(theta)=frac{1}{2m}[2sum_{i=1}^mx^i (theta_jx^{(i)} y^{(i)}) + 2lambdasum_{j=1}^ntheta^2_j=frac{1}{m}sum_{i=1}^m (h_theta x^{(i)} y^{(i)})x^i_j + frac {lambda} {m}theta_j$
$theta_j=theta_j-frac{partial}{partial theta_j}J(theta)alpha=(1-frac {alphalambda} {m})theta_j-frac {alpha} {m}sum_{i=1}^m (h_theta x^{(i)} y^{(i)})x^i_j$

Question 3

1.Implement batch gradient descent for linear regression using a fixed learning rate $eta$ of 1e-2. [1.5]
2.Run the gradient descent for a fixed number of epochs (based on convergence) and verify that the results closely approximate the matrix solution with the same lambda. [1]
In[25]:
def gradient_descent(alpha,,x,y,numIterations):
m = x.shape[0] # number of samples
theta = np.zeros(14)
b=0
x_transpose = x.transpose()
costplot=[]
for iter in range(0, numIterations):
hypothesis = np.dot(x, theta)+b
loss = hypothesis y
cost = (1/(2*m))*loss.T @ loss
costplot.append(cost)
gradient = np.dot(x_transpose, loss) / m
theta = theta alpha * gradient# update
if iter % 1000 == 0:
print(Epoch:,int(iter/1000+1))
print(W:,theta)
print(cost:,cost)
print(
)

return theta,costplot
=0.01

theta_, costplot= gradient_descent(,0, X_std, y, 10000)
plt.figure(figsize=(12, 8))
plt.plot(range(10000), costplot, o-, label=cost)
plt.legend()
plt.show()
#y_test_pred=np.dot(x_val,theta)

#print(Train RMSE : , rmse(y_test_pred, y_val))
#print(theta_)

Epoch: 1
W: [ 0.22532806 -0.035677470.03311776 -0.044444720.01610293 -0.03926228
0.06388975 -0.034634630.02296348 -0.03506386 -0.04304914 -0.0466555
0.03063842 -0.06777654]
cost: 296.07345849802374

Epoch: 2
W: [22.53184328 -0.781184290.81225774 -0.272960220.742528 -1.57768576
2.88911968 -0.10337755 -2.743098641.46066488 -0.88375756 -1.95231035
0.87219238 -3.64295466]
cost: 11.082223583143334

Epoch: 3
W: [22.53280628 -0.872435480.98512752 -0.094311460.71500795 -1.93870211
2.73741056 -0.02849154 -3.055370452.0495178-1.41256803 -2.02454326
0.85005571 -3.71537434]
cost: 10.97700639802383

Epoch: 4
W: [ 2.25328063e+01 -9.02897407e-011.03782470e+001.47606690e-02
6.99636589e-01 -2.02039368e+002.70038833e+00 -2.10039626e-03
-3.10194293e+002.34192952e+00 -1.71869388e+00 -2.04599478e+00
8.47868812e-01 -3.73080258e+00]
cost: 10.955463258630404

Epoch: 5
W: [ 2.25328063e+01 -9.15536228e-011.05954170e+007.40231518e-02
6.91270023e-01 -2.04242336e+002.68691330e+008.77783084e-03
-3.10695830e+002.49331793e+00 -1.88588052e+00 -2.05381422e+00
8.48227952e-01 -3.73705349e+00]
cost: 10.94966597972041

Epoch: 6
W: [ 2.25328063e+01 -9.21601950e-011.07008302e+001.05492362e-01
6.86797232e-01 -2.05007088e+002.68075049e+001.39480346e-02
-3.10628176e+002.57287969e+00 -1.97541927e+00 -2.05719928e+00
8.48665728e-01 -3.74017178e+00]
cost: 10.94804692596907

Epoch: 7
W: [ 2.25328063e+01 -9.24703373e-011.07551394e+001.22150174e-01
6.84420574e-01 -2.05336878e+002.67764967e+001.65713257e-02
-3.10535030e+002.61491388e+00 -2.02303919e+00 -2.05883888e+00
8.48939998e-01 -3.74180080e+00]
cost: 10.947592776318182

Epoch: 8
W: [ 2.25328063e+01 -9.26326519e-011.07836602e+001.30968866e-01
6.83160242e-01 -2.05497568e+002.67603552e+001.79381510e-02
-3.10475733e+002.63716174e+00 -2.04830200e+00 -2.05967731e+00
8.49092780e-01 -3.74266038e+00]
cost: 10.947465320555178

Epoch: 9
W: [ 2.25328063e+01 -9.27182880e-011.07987281e+001.35639165e-01
6.82492331e-01 -2.05580098e+002.67518548e+001.86577213e-02
-3.10442576e+002.64894438e+00 -2.06169223e+00 -2.06011567e+00
8.49175068e-01 -3.74311513e+00]
cost: 10.947429548304674

Epoch: 10
W: [ 2.25328063e+01 -9.27635927e-011.08067036e+001.38113015e-01
6.82138448e-01 -2.05623339e+002.67473606e+001.90380404e-02
-3.10424703e+002.65518590e+00 -2.06878732e+00 -2.06034680e+00
8.49218908e-01 -3.74335592e+00]
cost: 10.947419508247867

Question 4

1.Plot training error performance vs. number of epochs as a function of learning rate $eta$ for a range of values {1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1} with each $eta$ being a different line. [1]
For very low $eta$, gradient descent should be slow to converge. For very high $eta$, the error might be very unstable (i.e., not decreasing on every epoch).
In[40]:
def gradient_descentQ4(alpha,,x,y,numIterations):
m = x.shape[0] # number of samples
theta = np.zeros(14)
b=0
x_transpose = x.transpose()
costplot=[]
for iter in range(0, numIterations):
hypothesis = np.dot(x, theta)+b
loss = hypothesis y
cost = (1/(2*m))*loss.T @ loss
costplot.append(cost)
gradient = np.dot(x_transpose, loss) / m
theta = theta alpha * gradient# update

return theta,costplot

theta1,error1 = gradient_descentQ4(1e-6,0, x_train, y_train, 10000)
theta2,error2 = gradient_descentQ4(1e-5,0, x_train, y_train, 10000)
theta3,error3 = gradient_descentQ4(1e-4,0, x_train, y_train, 10000)
theta4,error4 = gradient_descentQ4(1e-3,0, x_train, y_train, 10000)
theta5,error5 = gradient_descentQ4(1e-2,0, x_train, y_train, 10000)
theta6,error6 = gradient_descentQ4(1e-1,0, x_train, y_train, 10000)

plt.figure(figsize=(12, 8))
plt.plot(range(10000), error1, -, label=error)
plt.plot(range(10000), error2, -, label=error)
plt.plot(range(10000), error3, -, label=error)
plt.plot(range(10000), error4, -, label=error)
plt.plot(range(10000), error5, -, label=error)
plt.plot(range(10000), error6, -, label=error)
plt.legend([1e-6, 1e-5, 1e-4, 1e-3,1e-2,1e-1], loc=upper right)
plt.show()

Question 5

Implement a mini-batch version with a batch_size constant B. When B=1, it is stochastic gradient descent. When B=#train data points, it is full-batch. Anywhere in between is mini-batch.
1.Choose B=1, B=16, B=128, B=256, and B=#data points and plot training error as a function of # of gradient updates and also separately as a function of wall-clock time for each value of B. [2]
2.All lines should be on the same plot. Which B leads to fastest convergence in terms of #gradient updates and in terms of wall-clock time? [1]

Question 6

1.Switch to absolute error for loss but still use $L_2$ regularization. Derive the gradient for this case, $frac{partial}{partial theta_j}J(theta)$. [1.5]
1.Tune $eta$ and B to obtain good performance (fast convergence to minimum error, but not noisy). Is it easier to train with squared error or absolute error, why? Show some plots that help you demonstrate this point and explain. [1]

Question 7

1.Does the model perform better if polynomial features are added (Use sklearn.preprocessing.PolynomialFeatures to test it out).
1.Does the optimal value of (regularization constant) change with this new set of features added.

References

Linear Regression:
1.https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
Gradient Descent (images and equations):
1.https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931
2.https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2
Author: Parth Jaggi

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] html python database graph statistic APS1070
$25