COM EM 747 Spring 2021 Assignment by: Rebecca Auger Chris Wells
Homework Assignment 2
Due: By text or word document, on Blackboard, by 5:00pm, Friday, March 5.
This assignment builds your learning in R, dplyr and ggplot. To have a little fun, you will work with a dataset of NHL player statistics for the majority of the assignment.
Please note: * Questions with (Q) require a brief written response * Questions with (C) require that you report the code used to answer the question
Part 1: Warm-Up
Remember our old friend mtcars? The dplyr package includes an expanded number of datasets to play around with, including a dataset of Star Wars characters called starwars.
1. Take a look at the starwars dataset. (With dplyr loaded, you can just type starwars.) Does this dataset follow the guidelines of tidy data? Why or why not? (Q)
2. Using ggplot, make a bar chart showing how many characters are from each homeworld in the dataset. What is the code, and what does your chart look like? (You can take a screenshot, or use the Export drop down menu to save as an image or copy to clipboard.) (Hint: to achieve this, you only need to use ggplot() with geom_bar()) (C)
3. The chart looks very cluttered, as it is now. If our goal here is to look at homeworlds that lots of Star Wars characters have come from, we might want to filter out planets that are the homeworld of only one character. Note: to do this, we will need several dplyr functions, which we will then pass to ggplot() to create the bar plot.
Below is some code that would accomplish this goal if it was in the correct order.
Reorder the code block to create a bar chart that only includes homeworlds that occur more than once in the dataset. (C)
filter(count>=2)%>% ggplot(aes(x=homeworld,y=count))+ group_by(homeworld)%>% starwars%>% geom_bar(stat=identity) summarise(count=n())%>%
1
Part 2: NHL Data
Load the NHL Player Statistics dataset from Blackboard.
Originally, this dataset is from naturalstattrick.com, a website with an abundance of NHL player data available for free.
Dont worry if you arent familiar with hockey; this assignment isnt designed to test your knowledge of the game. And the first step in any data analysis process is understanding the context of your data. But if you have questions, always feel free to ask!
Here is some discriptive information about the dataset:
Each row catalogues one players time with one team. Sometimes players are traded mid-way through a season, which means that part of their seasons record may be for one team at the start of a season and a different team at the end of the season. By separating individual player seasons into multiple rows, the dataset allows users to look at both player and team statistics.
The dataset contains statistics for one season, the 2018-2019 NHL season (the last season with a regular schedule).
The datasets variable names, with brief descriptions, are as follows:
Player: players full first and last name
Team: team the player was playing for when the stats were recorded and aggregated
Position: playing position (Center (C), Left-Wing (L), Right-Wing (R), Defenseman (D))
GP: games played
TOI: time on ice (in minutes)
Goals: goals scored by the player
Total.Assists: assists credited to the player for helping another player score (includes primary and
secondary assists)
Total.Points: in hockey, points are the sum of goals and assists
Shots: shots on goal; attempts to score
PIM: penalties in minutes
Total.Penalties: number of penalties
Penalties.Drawn: number of penalties against the player (i.e. the opposing team sustained a
penalty)
Giveaways: puck was lost to the other team
Takeaways: puck was taken from the other team
Hits: number of physical checks laid by the player
Hits.Taken: number of physical checks taken by the player
Faceoffs.Won: similar to a basketball tip-off, but a faceoff takes place before every round of play,
this statistic indicates the number of times player was able to gain control of the puck before the
opposing player
Faceoffs.Lost: number of times player lost control of the puck to the opposing player in the faceoff
General.Position: forward (F), includes center, left-wing, right-wing; and defensemen (D)
1. A consequence of the data structure is that if a player changed teams in mid-season, he will appear in two separate rowsone for each team. What step would be necessary to make observations based on player performance over the full season as opposed to player performance on a specific team? (Q)
1b. What dplyr function would accomplish this step? Provide a code snippet that would result in a tibble where each row is a players performance over the whole season. (Hint: the resulting tibble should have the same columns as the original data.) (Hint 2: the resulting tibble should have 906 rows.) (C)
2
2. Suppose you are trying to identify the best players in the league, and you have decided to use total points to assess performance.
2a. Keeping in mind that we want to look at each players season-level performance (i.e., keep using your code from 1b), create a histogram of the distribution of total points for all players (i.e. bars should represent the number of players achieving a certain number of points). Report the code used to create the histogram. What do you notice about the distribution of the histogram? (C)(Q)
2b. Choose a reasonable cut-off point that will help you identify who the very highest-scoring players are. What code would return the list/table of these players? (C)
2c. Create a bar chart with only these players: the players names should be on the x-axis, and the height of hte bars should indicate the total points they scored. (C)
2d. There are 82 games in a hockey season, and you might notice that many of the players with high total points have played almost all of them and have over 1000 minutes of time on ice. Maybe you are interested in the relationship between points and time on ice. Make a scatterplot comparing total points to total time on ice. Report your code. What basic relationship do you observe? What are 2 different causal explanations that could explain this relationship? (C)(Q)
2e. Interesting but hockey is a team sport, which means different players have different roles. Now build on your scatterplot: using the same x and y, now use color to indicate what position each player plays on the ice (use the variable General.Position, not Position). Report your code and briefly explain how this change helps you understand the relationships in the graph. (Hint: in the variable General.Position, D stands for Defense and F stands for Forward.) (C)(Q)
2.f. Now for a quick sub-set. When exploring data, we often want to be able to zoom in on datapoints that look especially different or interesting. Here, you might be interested in those 3 defensemen who scored quite a few points. If you wanted to zoom in on these individuals, and created a table that showed those three players (in rows) and their team, total points and time on ice, what dplyr commands would you use?
3. Maybe you dont really care about particular NHL players, you want to compare the performance of teams. Say we want to create a bar chart indicating the total goals scored by each team, another indicating the average goals scored by each player on each team, and a box plot so we can understand the distribution of goals on a team level.
3a. First, we want to think about what we want our graph to look like. If we want to compare teams with each other based on the all of the goals they have scored, what will be measured on the x-axis and y-axis of the chart? (Q)
3b. To begin setting up your data, use group_by and summarise to create a tibble with teams as rows and a value for each row showing how many goals the team scored. (C)
3c. Now, pipe this expression into ggplot and create a bar chart with the correct x and y axes, keeping in mind the default statistic used in geom_bar, and how to change it. (Accuracy check: There should be 32 bars on the graph, one for each team.) (C)
3d. Now lets take a look at average goals scored per player. You can use the code you produced for 3c, and make a minor alteration to display the mean number of goals scored by members of each team. (C)
4. Suppose you dont care about the rest of the NHL, the only thing that really matters to you is the Boston Bruins.
4a. Report the code used to make a tibble with only players from the Boston Bruins (BOS) included (C)
4b. Using pipes (%>%), report the code necessary to go from the original dataset to a scatterplot of Boston Bruins players total points versus time on ice (C)
4c. Building on Healys suggestions for datapoint labeling, choose three of the more notable datapoints from 4b and label them with the players names. (C)
3
5. Total points are one way to think about player skill. Another might be the accuracy of player shots. Percentages are common measures of accuracy, usually calculated as the proportion of attempts that succeeded. In the case of hockey, shot percentage might be a useful measure. There is no measure of shot percentage in the current data, but you can create it:
5a. Use the variables Shots and Goals to create a new variable, shot percentage. What you want to calculate is the percentage of shots that scored goals. (Hint: mutate() could be useful). (C)
5b. Lets graph that new variable somehow. Pipe your result from 5a into a ggplot() statement. You choose what you want to graph with it: you could graph it on its own as a histogram or bar plot, or you could use another variable to create a scatterplot. What have you found? (C), (Q)
4
Reviews
There are no reviews yet.