5/5 - (1 vote)

SAS Viya for Learners Visual Statistics
Decision Trees

Decision Trees

Decision Trees 3
Objectives
Describe how decision trees partition data in SAS Visual Statistics.
Describe how predictions are formulated for a decision tree.
Explain variable selection methods for decision trees.
Identify the tree variable selection methods that are available in SAS Visual Statistics.
Copyright SAS Institute Inc. All rights reserved.
As seen in the previous section, regressions, as parametric models, assume a specific association structure between inputs and target. By contrast, decision trees, as predictive algorithms, do not assume any association structure. They simply seek to isolate concentrations of cases with like- valued target measurements.
Decision trees are similar to other modeling methods that are described in this course. Cases are scored using prediction rules. A split-search algorithm facilitates predictor variable selection.
Useful predictions depend, in part, on a well-formulated model. Good formulation primarily consists of preventing the inclusion of redundant and irrelevant predictors (input variables) in the model. The predictor variable selection function is complicated with large data. There are usually many predictors to consider and many pieces of information (rows) about these columns to process. This complication adds to the requirements of the input search method for any given model. The method must eradicate redundancies and irrelevancies, and also be extremely efficient. (The input search methods available in the decision tree algorithm in Visual Statistics are described in this section.)
The simple prediction problem described below illustrates each of these model essentials.
Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.

Model Essentials: Decision Trees
Predict cases.
Select useful predictors.
Copyright SAS Institute Inc. All rights reserved.
Prediction rules
Split search
Consider a data set with two predictors and a binary target. The predictors, x1 and x2, locate each case in the unit square. The target outcome is represented by a color. Yellow is primary and blue is secondary. The analysis goal is to predict the outcome based on the location in the unit square.
To predict cases, decision trees use rules that involve the values of the predictor variables. Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.
Decision Tree Prediction Rules
<0.63 0.631.0 0.9 0.8 0.7 0.60.51 x2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.interior node0.52 <0.510.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1The rules are arranged hierarchically in a tree-like structure with nodes connected by lines.The nodes represent decision rules, and the lines order the rules. The first rule, at the base (top) of the tree, is named the root node. Subsequent rules are named interior nodes. Nodes with only one connection are leaf nodes.To score a new case, examine the associated input variable values. Then apply the rules that are defined by the decision tree.Decision Trees 5Decision Tree Prediction Rules1.0 0.9 0.8 0.7 0.6interior nodex2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.<0.51 0.510.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Decision Tree Prediction Rules1.0 0.9 0.8 0.7 0.6Decision = Estimate = 0.70×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.<0.51 0.510.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1The input variables values of a new case eventually lead to a single leaf in the tree. A tree leaf provides a decision (for example, classify as yellow) and an estimate (for example, the primary- target proportion).Model Essentials: Decision TreesPredict cases.Select useful predictors.Copyright SAS Institute Inc. All rights reserved.Prediction rules Split searchTo select useful predictors, trees use a split-search algorithm. Decision trees confront the curse of dimensionality by ignoring irrelevant predictors.Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Understanding a split-search algorithm for building trees enables you to better use the Tree tooland interpret your results. The description presented here assumes a binary target, but the algorithm for interval targets is similar.The first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available training data. If the measurement scale of the selected input is interval, unique values serve as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. The averages serve the same role as the unique interval input values in the discussion that follows.For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. The groups, combined with the target outcomes, form a 2×2 contingency table. The columns specify branch direction (left or right) and rows specify target value (0 or 1). An information gain statistic that is based on the entropy of the root node and the entropy of the data in each partition of the split can be used to quantify the separation of counts in the tables columns. Large values for the gain statistic suggest that the proportion of zeros and ones in the left branch is different from the proportion in the right branch. A large difference in outcome proportions indicates a good split. An example of this calculation is given below.The split-search diagnostic used in Visual Statistics depends on the method that is used to grow or train the tree. Under Interactive training, a chi-square log-worth-based statistic is used to evaluate splits. The default split search method under autonomous tree growth is based on an information gain statistic. An example of calculating the gain under the default method is given below. The Rapid Growth functionality combines k-means clustering with the gain statistic to grow the tree.A gain ratio statistic is used for split evaluation when the Split Best option is used in combination with the Rapid Growth property.Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Decision Trees 7left rightConfusion Matrix1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Decision Tree Split Search Calculate information gain on partitionson input x1.Copyright SAS Institute Inc. All rights reserved.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1left right0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Decision Tree Split Searchmax gain(x1) 0.01121.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0Select the partition with the maximum gain.Copyright SAS Institute Inc. All rights reserved.The best split for a predictor is the split that yields the highest information gain. For the gain calculation example, assume that there are 100 total observations and a 50/50 split of yellow/blue in the training data. Also, there are 52 observations to the left of the 0.52 split point and 48 observations to the right of the split in the diagram shown above. Based on this and the numbers given in the table, gain can be formulated as shown below.EntropyTotal .5log2(.5).5log2(.5)=1 EntropyLeft .53log2(.53).47log2(.47)=0.997 EntropyRight .42log2(.42).58log2(.58)=0.98 Gain 1(52/100)0.997(48/100)0.98= 0.0112Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.The partitioning process is repeated for every input in the training data.Again, the optimal split for the input is the one that maximizes the gain function.Decision Trees 9left rightDecision Tree Split Searchmax Gain(x1) 0.01121.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0 Repeat for input x2.Copyright SAS Institute Inc. All rights reserved.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Decision Tree Split Searchbottom topmax gain(x1) 0.01121.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved. max gain(x2) 0.0273 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Decision Tree Split Search Compare partition gain ratings.max gain(x1) 0.01121.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.bottom top max gain(x2) 0.0273 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1After you determine the best split for every input, the tree algorithm compares each best splits corresponding gain. The split with the highest gain is regarded as best.The best split rule is used to partition the data.Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0 Create a partition rule from the best partition acrossall inputs.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright SAS Institute Inc. All rights reserved.Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Decision Trees 11Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.Repeat the process in each subset.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1left rightDecision Tree Split Search 1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved. max gain(x1) 0.0203 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved. max gain(x1) 0.0203 bottom top max gain(x2) 0.0190.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved. max gain(x1) 0.0203bottom top max gain(x2) 0.019 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.The split search is repeated within each new leaf. Gain statistics are compared as before.The resulting partition of the predictor variable space is known as the maximal tree. Under the default settings, development of the maximal tree is based exclusively on statistical measures of gain on the data.Decision Trees 13Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Copyright SAS Institute Inc. All rights reserved.Create a second partition rule.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Decision Tree Split Search1.0 0.9 0.8 0.7 0.6×2 0.5 0.4 0.3 0.2 0.1 0.0Repeat to form a maximal tree.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1Copyright SAS Institute Inc. All rights reserved.Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.4.1 Decision Trees in SAS Visual StatisticsObjectives Describe decision tree variable roles in SAS Visual Statistics. Identify the decision tree properties in SAS Visual Statistics. Cultivate a decision tree. Assess decision tree performance.Copyright SAS Institute Inc. All rights reserved. Decision Trees in SAS Visual Statistics There is only one response variable. It can be either a category or a measure. (Both classification trees and regression trees can be created.) There can be multiple predictors. Both category and measure predictors are accommodated.(No interaction terms are allowed.) Using Interactive mode, you can manually train and prune a decision tree. You can derive a leaf ID. This ID can be used in other models that are featured in the SAS Visual Statistics functionality.Copyright SAS Institute Inc. All rights reserved.Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.The decision tree in SAS Visual Statistics uses a modified version of the C4.5 algorithm.Note: To enter Interactive mode, right-click a tree and select Enter Interactive Mode.Note: One difference between trees and other modeling algorithms that are presented in this course is that decision trees are available in SAS Visual Analytics without adding SAS Visual Statistics. However, SAS Visual Statistics does augment the decision tree functionality. Further, some decision tree default settings are modified with the Visual Statistics addition.Categorical-valued and interval-valued response or target variables are accommodated in the SAS Visual Statistics decision tree model. Although multilevel categorical target variables are allowed, one level is chosen as the event level, and other levels are combined into the non-event category.For binary target variables, changing the event level does not affect the hierarchical structure of the decision tree. It does change the assessment plots (confusion matrix, lift, ROC, and misclassification) that are generated for each event level. To do model comparisons (for example, between a logistic regression and a decision tree), you need to make sure that your models target the same outcome.Note: For a measure response variable, choose whether to bin the response variable in the Options pane. This determines whether a classification tree or regression tree is created. Bin the response variable to create a classification tree or keep it unmodified to create a regression tree.Decision Trees in SAS Visual Statistics 15 Decision Tree Roles Response only one measure or category variable Predictors assign any numberof measure and category variables PartitionIDonlyonepartition variableCopyright SAS Institute Inc. All rights reserved. Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. continued… Decision Tree Event level Autotune Missing assignment Minimum value Growth strategy Maximum branches Maximum levels Leaf size Bin response variable Predictor bins Bin methodDecision Tree Options Rapid growth Prune withvalidation data Reuse predictors Number of bins Prediction cutoff Statistic percentile ToleranceCopyright SAS Institute Inc. All rights reserved. Decision Tree Options ModelDisplay Plot layout (General) Statistic to show (Decision Tree / Icicle Plot) Statistic to show Legend visibility Plot type Plot to show Confusion matrix legend visibilityCopyright SAS Institute Inc. All rights reserved. Decision Tree Event level enables you to specify the event level. Select Choose to choose the event level. A dialog box is displayed. It enables you to select the event level that you want to model. In this window, select the appropriate radio button and click OK in the Select Event Level window. ByCopyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.default, the event levels are sorted alphabetically. Make sure that you are modeling for the event of interest. Autotune Note: For more detailed information about autotuning, see Autotuning in SAS Visual Analytics 8.5: Working with SAS Visual Data Mining and Machine Learning (https://go.documentation.sas.com/?docsetId=vaobjdmml&docsetTarget=n1ot6nwcbwp4jm n1r7vks7d8g3ri.htm&docsetVersion=8.5&locale=en#n1usfp24hnj2vpn1aj279bcqot5e)in the online documentation. Missing Assignment enables you to specify how observations with missing values are includedin the model. None Observations with missing values are excluded from the model. Use in search If the number of observations with missing values is greater than or equal to Minimum value, then missing values are considered a unique measurement level and are included in the model. As machine smallest Missing interval values are set to the smallest possible machine value and missing category values are treated as a unique measurement level. Popular Observations with missing values are assigned to the sub-node with the most observations. Similar Observations with missing values are assigned to the node that is considered most similar by a chi-square test for category responses or an F test for measure responses.Note: The default method for handling missing values varies across model types in SAS Visual Statistics. In contrast to other models, the default for decision trees is Use in search. Minimum value Growth strategy specifies the parameters that are used to create the decision tree. Custom enables you to select the values. Basic specifies a simple tree with a maximum of two branches per split and a maximum of six levels. Advanced specifies a complex tree with a maximum of four branches per split and a maximum of six levels. Modeling specifies a tree with default options in SAS Visual Statistics 7.1. Maximum branches specifies the maximum number of branches that are allowed when yousplit a node. The default is 2 and the maximum is 10. Maximum levels specifies the maximum depth of the decision tree. The default is 6 and the maximum is 20. Leaf size specifies the minimum number of observations that are allowed in a leaf node. The default is 5. Bin response variable Response bins Decision Trees in SAS Visual Statistics 17 enables you to specify the hyperparameters that control the autotuning algorithm. The hyperparameters determine how long the algorithm can run, how many times the algorithm can run, and how many model evaluations are allowed. The autotuning algorithm selects the Maximum levels and Predictor bins values that produce the best model. specifies the minimum number of observations that are allowed to have missing values before missing values are treated as a distinct category level. This option is used only when Missing assignment is set to Use in search. specifies whether a measure response variable is binned. When a variable is binned, a decision tree is created. Otherwise, a regression tree is created. specifies the number of bins that are used to categorize a measure response Copyright 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Predictor bins specifies the number of bins that are used to categorize a predictor that isa continuous variable. The default is 20. Smaller values lead to a simpler tree that might be less accurate. Larger values lead to a more complex model that takes longer to train and might be overfit. Bin method Rapid growth enables you to use the information gain ratio and k-means fast search methods for decision tree growth. Also, when this option is enabled, bin ordering is ignored. When the option is disabled, the information gain and greedy search methods are used. That generally producesa larger tree and requires more time to create. Also, when disabled, bin order CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS SAS Viya for Learners Visual Statistics

Reviews

Whatsapp Us

[SOLVED] CS SAS Viya for Learners Visual Statistics

Reviews

Related products

[SOLVED] SciCalculator

[Solved] Program that reads in the file climate_data_2017_numeric.csv

[Solved] Indel

[SOLVED] COP 3223 Program #1: Vacation Planning

[Solved] List Maintainer

[SOLVED] Naughty Receiver – Reliable Data Transfer