[SOLVED] COMP20008 Elements of Data Processing 2

$25

File Name: COMP20008_Elements_of_Data_Processing_2.zip
File Size: 367.38 KB

5/5 - (1 vote)

Supervised learning Introduction
School of Computing and Information Systems
@University of Melbourne 2022

Copyright By Assignmentchef assignmentchef

Regression vs Classification
COMP20008 Elements of Data Processing 2

Classification Example 1
Predicting disease from microarray data Training
Develop cancer <1 year Develop cancer <1 year COMP20008 Elements of Data Processing 3Classification Example 2Animal classification https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdfCOMP20008 Elements of Data Processing 4 Classification- Example 3Banking: classifying borrowersHome OwnerMarital statusAnnual IncomeDefaulted BorrowerCOMP20008 Elements of Data Processing 5Classification Example 4Detecting tax cheats1 2 3 4 5 6 7 8 9 10No No No No Yes No No Yes No YesMarital StatusTaxable IncomeCOMP20008 Elements of Data Processing 6categorical categoricalcontinuous classClassification: Definition Given a collection of records (training set ) Eachrecordcontainsasetofattributes,oneclasslabel. Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,…,%() !: discrete value, target variable %1, … %(: attributes, predictors #: is the predictive model (a tree, a rule, a mathematical formula) Goal: previously unseen records should be assigned a class as accurately as possible Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate itCOMP20008 Elements of Data Processing 7Classification framework COMP20008 Elements of Data Processing 8Regression: Definition Given a collection of records (training set ) Eachrecordcontainsasetofattributes,onetargetvariable. Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,…,%() !: continuous value, target variable %1, … %(: attributes, predictors #: is the predictive model (a tree, a rule, mathematical formula) Goal: previously unseen records should be assigned a value as accurately as possible Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.,the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate itCOMP20008 Elements of Data Processing 9Regression example 1Predicting ice-creams consumption from temperature: ! = #(%) ?COMP20008 Elements of Data Processing 10Regression example 2Predicting the activity level of a target gene Person m+1COMP20008 Elements of Data Processing 11Regression Linear regressionSchool of Computing and Information Systems@University of Melbourne 2022Learning Objectives How to use linear regression analysis to predict the value of a dependent variable based on independent variables Make inferences about the slope and correlation coefficient Evaluate the assumptions of regression analysis and know what to doif the assumptions are violatedCOMP20008 Elements of Data ProcessingIntroduction to Regression AnalysisRegression analysis is used to: Predict the value of a dependent variable based on the value of at least one independent variable Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to explain the dependent variableCOMP20008 Elements of Data ProcessingSimple Linear Regression Model Only one independent variable, X Relationship between X and Y is described by a linear function Changes in Y are assumed to be caused by changes in XCOMP20008 Elements of Data ProcessingTypes of Relationships (1)Linear relationshipsNon-linear relationshipsCOMP20008 Elements of Data ProcessingTypes of Relationships (2) Strong relationshipsWeak relationships No relationshipCOMP20008 Elements of Data ProcessingSimple Linear Regression Model The simple linear regression equation provides an estimate of the population regression lineDependent VariableYi =0 +1Xi +iSlope CoefficientIndependent VariableError termLinear componentError componentCOMP20008 Elements of Data ProcessingSimple Linear Regression Model (2)Yi=0 +1Xi+i ifor this X value iObserved Value of Y for XiPredicted Value of Y for XiIntercept = 0Slope = 1COMP20008 Elements of Data ProcessingLeast Squares Methodb0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between ! and !”!”#$%%'”=!”#$%)+)+ ” !! !#$!COMP20008 Elements of Data ProcessingInterpretation of Slope and Intercept b0 is the estimated average value of Y when the value of X is zero (intercept) b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)COMP20008 Elements of Data ProcessingSimple Linear Regression Example A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)A random sample of 10 houses is selected Dependent variable (Y) = house price in $1000s Independent variable (X) = square feet COMP20008 Elements of Data ProcessingSample Data for House Price ModelHouse Price in $1000sSquare Feet COMP20008 Elements of Data ProcessingGraphical PresentationHouse price model: scatter plot1000 1500 2000 2500 3000Square Feet COMP20008 Elements of Data ProcessingHouse Price ($1000s)Calculation OutputRegression StatisticsThe regression equation is: Multiple R R SquareAdjusted R Square Standard Error Observations0.76211 0.58082 0.5284241.33032 10house price = 98.24833 + 0.10977 (square feet) Regression 1 Residual 8 Total 9CoefficientsIntercept 98.24833Square Feet 0.1097718934.9348 13665.5652 32600.5000Standard Error18934.9348 1708.1957Significance F-35.57720 0.03374232.07386 0.18580COMP20008 Elements of Data ProcessingGraphical PresentationHouse price model: scatter plot and regression lineIntercept = 98.2481500 2000 2500 3000Square Feet house price = 98.24833 + 0.10977 (square feet) COMP20008 Elements of Data ProcessingHouse Price ($1000s)Interpretation of the Intercept b0house price = + 0.10977 (squarefeet) b0 is the estimated average value of Y when the value of X is zero Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet COMP20008 Elements of Data ProcessingInterpretation of the Slope Coefficient b1house price =98.24833 + (square feet) b1 measures the estimated change in the average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size COMP20008 Elements of Data ProcessingPredictions using Regression Predict the price for a house with 2000 square feet:house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850 COMP20008 Elements of Data ProcessingInterpolation vs. ExtrapolationWhen using a regression model for prediction, only predict within the relevant range of data Relevant range for interpolation0 500 Department of Statistics, ITS Surabaya1000 1500 2000 2500 3000Square FeetDo not try to extrapolate beyond the range of observed Xs COMP20008 Elements of Data ProcessingHouse Price ($1000s)Multiple Regression Multiple regression is an extension of simple linear regression It is used when we want to predict the value of a variable basedon the value of two or more other variables The variable we want to predict is called the dependent variable The variables we are using to predict the value of the dependent variable are called the independent variablesCOMP20008 Elements of Data ProcessingMultiple Regression ExampleA researcher may be interested in the relationship between the weight of a car, the power of the engine, and petrol consumption. Independent Variable 1: weight Independent Variable 2: horsepower Dependent Variable: miles per gallonCOMP20008 Elements of Data ProcessingMultiple Regression Fitting Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = b0 + b1X1 + b2X2 More independent variables extend this into higher dimensions The plane (or higher dimensional shape) will be placed so that it minimises the distance (sum of squared errors) to every data pointCOMP20008 Elements of Data ProcessingMultiple Regression Graphic COMP20008 Elements of Data ProcessingMultiple Regression Assumptions Multiple regression assumes that the independent variables are not highly correlated with each other Use scatter plots to checkCOMP20008 Elements of Data ProcessingMultiple regression assumes that the independent variables are not highly correlated with each otherMultiple Regression AssumptionUse scatter plots to checkCOMP20008 Elements of Data ProcessingRegression Linear regression cont.School of Computing and Information Systems@University of Melbourne 2022Measures of Variation (1)Total variation is made up of two parts:SST = SSR + SSE Total Sum of SquaresRegression Sum of SquaresS S T = $ Y Y’ % S S R = $ Y) Y’ % S S E = $ Y Y) % $$$$ Y! = Average value of the dependent variableY! = Observed values of the dependent variable #Y! = Predicted value of Y for the given Xi valueCOMP20008 Elements of Data ProcessingError Sum of SquaresMeasures of Variation (2) SST = total sum of squares Measures the variation of the Y! values around their SSR = regression sum of squares Explained variation attributable to the relationshipbetween X and Y SSE = error sum of squares Variation attributable to factors other than therelationship between X and Y COMP20008 Elements of Data ProcessingMeasures of Variation (3)S S T = ‘ ( ” (! #= ‘ ( (# # SSE ” “= ‘ (# (! # “COMP20008 Elements of Data ProcessingCoefficient of Determination, r2 The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable The coefficient of determination is also called r-squared and is denoted as r2r2 =SSR/SST= regression sum of squares / total sum ofsquares0 $” 1COMP20008 Elements of Data ProcessingExamples of approximate r2 valuesPerfect linear relationship between X and Y:100% of the variation in Y is explained by variation in XCOMP20008 Elements of Data Processing 6Examples of approximate r2 values (2)0 < r2 < 1Weaker linear relationships between X and Y:Some but not all of the variation in Y is explained by variation in X COMP20008 Elements of Data ProcessingExamples of approximate r2 values (3)No linear relationship between X and Y:The value of Y does not depend on X.COMP20008 Elements of Data ProcessingCalculation OutputRegression Statisticsr2 = SSR = 18934.9348 = 0.58082 SST 32600.5000 Multiple RR Square Adjusted R SquareStandard Error Observations0.58082 0.5284241.33032 10 58.08% of the variation in house prices is explained by variation in square feetRegressionResidual 8 13665.5652 918934.934832600.5000Coefficients18934.9348 1708.1957Standard Error t Stat58.03348 1.692960.03297 3.32938Significance F-35.57720 0.03374232.07386 0.18580 Intercept 98.24833Square Feet 0.10977 COMP20008 Elements of Data ProcessingAssumptions of Regression Linearity: The underlying relationship between X and Y is linear Independence of residuals: Residual values (also known asfitting errors) are statistically sound and sum to zeroCOMP20008 Elements of Data ProcessingAssumptions of Regression! = # #% (((The residual for observation i, ei, is the difference between the observed and predicted value Check the assumptions of regression by examining the residuals Examine for linearity assumption Evaluate independence assumption Examine for constant variance for all levels of X Graphical Analysis of Residuals: plot residuals vs. X COMP20008 Elements of Data ProcessingResidual Analysis for Linearity Not Linear COMP20008 Elements of Data ProcessingResidual Analysis for IndependenceNot IndependentIndependent COMP20008 Elements of Data Processingresiduals residualsResidual Analysis for Equal VarianceNon-constant varianceConstant varianceCOMP20008 Elements of Data ProcessingHouse Price Residual OutputHouse Price Model Residual PlotRESIDUAL OUTPUT Predicted House Price1 2 3 4 5 6 7 8 980 60 40 201000 2000 3000Square FeetDoes not appear to violate any regression assumptions COMP20008 Elements of Data ProcessingAvoiding the Pitfalls of Regression Start with a scatter plot of X vs. Y to observe possible relationships Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations Avoid making predictions or forecasts outside the relevant range Formultipleregressionremembertheimportance of independence assumptions on the independent variables.COMP20008 Elements of Data ProcessingClassification Decision TreesSchool of Computing and Information Systems@University of Melbourne 2022 Decision Tree example: Prediction of tax cheats Splitting Attributes1 2 3 4 5 6 7 8 9 10No No No No Yes No No Yes No YesNO MarSt Single, DivorcedTraining DataModel: Decision TreeCOMP20008 Elements of Data Processingcategorical categoricalcontinuous classDecision Tree example: Prediction of tax cheats 2Single, Divorced1 2 3 4 5 6 7 8 9 10No No No No Yes No No Yes No YesTraining DataThere can be more than one tree that fits the same data!COMP20008 Elements of Data ProcessingModel: Decision Treecategorical categoricalcontinuous class Apply DT Model to Test Data (1) Start from the root of the treeSingle, DivorcedCOMP20008 Elements of Data Processing Apply DT Model to Test Data (2)Single, DivorcedCOMP20008 Elements of Data Processing Apply DT Model to Test Data (3)Single, DivorcedCOMP20008 Elements of Data Processing Apply DT Model to Test Data (4)Single, DivorcedCOMP20008 Elements of Data Processing Apply DT Model to Test Data (5)Single, DivorcedCOMP20008 Elements of Data Processing Apply DT Model to Test Data (6)Single, DivorcedAssign Cheat to NoCOMP20008 Elements of Data Processing Apply DT Model to Test Data (7) Start from the root of tree.Single, DivorcedCOMP20008 Elements of Data Processing Decision TreesDecision tree A flow-chart-like tree structure Internal node denot CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] COMP20008 Elements of Data Processing 2
$25