Abstract
To see how regression will help us evaluate baseball team performance, we will explore, analyze and model a historical baseball data set containing approximately 2200 records, to determine a teams performance based on statistics of their performance. Each record represents a professional baseball team from the years 1871 to 2006 inclusive, and the data include the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
While correlation does not equal causation it is suggested that a focus on some of the variables such as a focus on either single hits or triple or more hits to the exclusion of doubles might be worth pursuing. Also the data suggests that a focus on home runs allowed may not be worth giving up a number of more normal hits.
..To add more here.
Introduction
Because baseball is so numbers-heavy, there are many different statistics to consider when searching for the best predictors of team success. There are offensive statistics (offense meaning when a team is batting) and defensive statistics (defense meaning when a team is in the field). These categories can be broken up into many more subcategories. However, for the purpose of the this project we will use the available data to build a multiple linear regression model on the training data to predict the number of wins for the team.
Data Used
VARIABLE NAME DEFINITION | THEORETICAL EFFECT |
INDEX Identification Variable (do not use) | None |
TARGET_WINS Number of wins | Outcome Variable |
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) | Positive Impact on Wins |
TEAM_BATTING_2B Doubles by batters (2B) | Positive Impact on Wins |
TEAM_BATTING_3B Triples by batters (3B) | Positive Impact on Wins |
TEAM_BATTING_HR Homeruns by batters (4B) | Positive Impact on Wins |
TEAM_BATTING_BB Walks by batters | Positive Impact on Wins |
TEAM_BATTING_HBPBatters hit by pitch (get a free base) | Positive Impact on Wins |
TEAM_BATTING_SO Strikeouts by batters | Negative Impact on Wins |
TEAM_BASERUN_SB Stolen bases | Positive Impact on Wins |
the data was provided in csv file. The files contain approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
1
VARIABLE NAME DEFINITION | THEORETICAL EFFECT |
TEAM_BASERUN_CS Caught stealing | Negative Impact on Wins |
TEAM_FIELDING_E Errors | Negative Impact on Wins |
TEAM_FIELDING_DP Double Plays | Positive Impact on Wins |
TEAM_PITCHING_BBWalks allowed | Negative Impact on Wins |
TEAM_PITCHING_H Hits allowed | Negative Impact on Wins |
TEAM_PITCHING_HRHomeruns allowed | Negative Impact on Wins |
TEAM_PITCHING_SO Strikeouts by pitchers | Positive Impact on Wins |
Data exploration
Variable Selection
Outliers
Correlations among predictors
Data Preparation
Build Models
Select Model
Appendix
Reviews
There are no reviews yet.