5/5 - (1 vote)

1
HW7: Dates Manipulation on CitiBike data

1. Load the packages readr.
Use readcsv to create a tibble from the csv file at
coursedatacitybike.csv taken from sampled data from
Citibike usage datahttps:www.citibikenyc.comsystemdata and
assign it to variable trips. The file contains data corresponding to
every bike trip taken by users in August 2016.
Peek at the first rows of trips using head to get a feel of
what your data is like and assign the result to tibble headtrips.
headtrips should print to something like
A tibble: 6 x 15
tripduration starttime stoptime start station ??? start station ???
dbl chr chr dbl chr
1 5945 852016??? 85201???228 E 48 St3 Ave
2 1494 832016??? 83201???460 S 4 StWythe ???
3826 830201??? 83020??? 3301 Columbus Ave???
4278 815201??? 81520??? 3256 Pier 40Hudso???
5729 882016??? 88201???347 Greenwich St???
6502 822201??? 82220???382 University Pl ???
??? with 10 more variables: start station latitude dbl, start station
longitude dbl, end station id dbl, end station name chr,
end station latitude dbl, end station longitude dbl,
bikeid dbl, usertype chr, birth year dbl, gender dbl
Do not modify this line!

2. Load the dplyr package.
Modify the gender column to a factor column with levels ordered from 0
to 1 and the labels Unknown for 0, Male for 1, Female for 2.
Assign the resulting tibble to tripswithgenders.
Hint: you can use mutate and factor, and judiciously use the
levels and labels arguments of factor.
The gender column of tripswithgenders should print to:
A tibble: 100,000 x 1
gender
fct
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Male
8 Male
9 Male
10 Male
??? with 99,990 more rows
And the rest of the tibble should be the same as in question 1.
Do not modify this line!

3. Load the lubridate package.
Starting with trips, modify the starttime and stoptime columns to
contain dates in the format monthdayyearhourminutesecond.
Assign the resulting tibble to tripswithdates.
Hint: you can use mutate and mdyhms.
The starttime and stoptime columns of tripswithdates should print
to:
A tibble: 100,000 x 2
starttime stoptime
dttmdttm
1 20160805 14:15:11 20160805 15:54:16
2 20160803 22:56:34 20160803 23:21:28
3 20160830 07:41:07 20160830 07:54:54
4 20160815 20:39:47 20160815 20:44:26
5 20160808 17:40:31 20160808 17:52:40
6 20160822 07:26:03 20160822 07:34:26
7 20160831 16:32:42 20160831 16:40:08
8 20160810 19:40:36 20160810 19:59:15
9 20160824 11:29:54 20160824 11:32:12
10 20160830 12:16:21 20160830 12:21:09
??? with 99,990 more rows
And the rest of the tibble should be the same as in question 1.
Do not modify this line!

4. Create a tibble containing the starttime column from tripswithdates
as well as two additional columns startymd and starthour containing
the starting time day and hour of each trip.
To do so, you need to:
starting with tripswithdates, keep only the starttime column,
you can use select for this
then create the two columns from starttime, you can use mutate
along with floordate and hour for this
assign the result to tibble tripsstarttimes
A tibble: 100,000 x 3
starttime startymd starthour
dttmdttm int
1 20160805 14:15:11 20160805 00:00:00 14
2 20160803 22:56:34 20160803 00:00:00 22
3 20160830 07:41:07 20160830 00:00:007
4 20160815 20:39:47 20160815 00:00:00 20
5 20160808 17:40:31 20160808 00:00:00 17
6 20160822 07:26:03 20160822 00:00:007
7 20160831 16:32:42 20160831 00:00:00 16
8 20160810 19:40:36 20160810 00:00:00 19
9 20160824 11:29:54 20160824 00:00:00 11
10 20160830 12:16:21 20160830 00:00:00 12
??? with 99,990 more rows
Do not modify this line!

5. Compute 1 the total number of trips, 2 the number of days in which there
was at least one trip and 3 the average number of trips that started
for each hour of the day over the whole month. Assign the resulting
tibble to tripsperhour.
To do so, you need to:
starting with tripsstarttimes, group by the hour of the day, you
can use groupby for this
then compute the three statistics listed above you can use
summarize for this
Hint:The function n returns the number of rows, and
ndistinctcolumn returns the number of distinct values in column.
tripsperhour should print to:
A tibble: 24 x 4
starthour numtrips numdays meantrips
int intintdbl
10 986 3131.8
21 506 3116.3
32 322 3110.4
43 193 30 6.43
54 195 31 6.29
65 587 3118.9
762262 3173.0
874784 31 154.
987958 31 257.
1096528 31 211.
??? with 14 more rows
Do not modify this line!

2
HW7: Strings and Dates

In this exercise, you will manipulate a dataset with strings and factors.

Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.

In this exercise, you have to recreate the figures found at
the left of the instructions.
We suggest functions you can use to create the plots, but
you are free to use the method you are the most comfortable with.
Make sure that the figures look exactly like the ones you are supposed to create.

1. Load the tidyverse, rattle,ggplot2 and lubridate packages.
Use dataweather to read the dataset weather from the
rattle package.
Store it in a tibble weather using astibble.
Do not modify this line!

2. Create two tibbles weather9am and weather3pm.
weather9am should contain all the variables which names end with
9am and the column Date. 9am should be removed from the
names of its variables and the format of Date should be changed as follow :
Instead of YMD, add YMD h:m:s AECT where AECT is the time zone
code for Australian Eastern Time and h:m:s is 09:00:00 9 am.
For example, the Date column first value will be 20071101 09:00:00 AEDT.
weather3pm should contain all the variables which names end with
3pm and the column Date. 3pm should be removed from the
names of its variables and the format of Date should be changed as follow :
Instead of YMD, add YMD h:m:s AECT where AECT is the time zone
code for Australian Eastern Time and h:m:s is 15:00:00 3 pm.
For example, the Date column first value will be 20071101 15:00:00 AEDT.
To do so, you can use :
select and endswith to select the right columns
renameall and strremove to rename the columns correctly
mutate, ymdhms and paste to change the format of Date
Hint : If you useymdhms, set tz to AustraliaCanberra
weather9am should print to :
A tibble: 366 x 7
DateWindDir WindSpeed Humidity Pressure CloudTemp
dttmord dblintdbl int dbl
1 20071101 09:00:00 SW6 681020. 714.4
2 20071102 09:00:00 E 4 801012. 517.5
3 20071103 09:00:00 N 6 821010. 815.4
4 20071104 09:00:00 WNW30 621006. 213.5
5 20071105 09:00:00 SSE20 681018. 711.1
6 20071106 09:00:00 SE 20 701024. 710.9
7 20071107 09:00:00 SE 19 631025. 412.4
8 20071108 09:00:00 SE 11 651026. 612.1
9 20071109 09:00:00 E19 701026. 714.1
10 20071110 09:00:00 S 7 821024. 713.3
??? with 356 more rows

Do not modify this line!

3. Join the tibbles weather9am and weather3pm to create a tibble
weatherarrange that contains the data in both weather9am and
weather3pm. However, its Date should change to become the
corresponding day of the year, and be named Day of the year.
For example, 20071123 15:00:00 AEDT will become 23.
Then, order the tibble from the earliest Day of the year to the latest.
To do so, you can use :
fulljoin to join the datasets
mutate and yday to change the Date
rename to change the name of Date
arrange to reorder the tibble
weatherarrange should print to :
A tibble: 732 x 7
Day of the year WindDir WindSpeed Humidity Pressure CloudTemp
dbl ord dblintdbl int dbl
1 1 ESE 2 501016. 721.9
2 1 W11 201012. 631.8
3 2 ESE 2 431013. 023
4 2 WSW 9 141009. 133.6
5 3 ESE20 691017. 819.2
6 3 ESE24 551016. 722.3
7 4 ESE24 561016. 620.3
8 4 ESE26 451013723.9
9 5 SSE 7 691012. 718.6
10 5 E15 401007. 226.8
??? with 722 more rows

Do not modify this line!

4. Plot both the smoothed temperature Temp multiplied by 3,
and Humidity as function of the date.
To do that, you can use :
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot from weatherarrange, with
Day of the year as its xaxis
geomsmooth to plot the smoothed Humidity, with col set
to green, method set to loess, and formula to yx
These are the approximation methods for the smoothing process
geomsmooth to plot the smoothed 3Temp, with col set
to yellow, method set to loess, and formula to yx
labs to set :
title to Temperature and Humidity are negatively correlated in Canberra
x to Day of the year
y to Humidity in green and rescaled Temperature in yellow
themelight
Save the plot into temphumidityplot.
Do not modify this line!

5. Create a tibble weatherwind that contains the wind direction WindDir,
the temperature Temp, and the main wind direction WindMainDir, for each
date and time we have two recordings a day at 9am and 15pm.
It should be ordered from lowest temperature to highest, and not contain
any NA value. WindMainDir should correspond to the first letter of
WindDir and be converted to factor. For example, SW main direction is S.
To do that, please create two tibbles weatherwind9am and weatherwind3pm
that contain the columns WindDir and Temp from weather9am ans
weatherwind3pm. You can use select.
weatherwind9am should print to :
A tibble: 366 x 2
WindDirTemp
ord dbl
1 SW 14.4
2 E17.5
3 N15.4
4 WNW13.5
5 SSE11.1
6 SE 10.9
7 SE 12.4
8 SE 12.1
9 E14.1
10 S13.3
??? with 356 more rows
weatherwind3pm has the same columns, but with different values.
Then, join these two tibbles into a tibble weatherwind using fulljoin.
You can then :
use dropna to drop the NA values
use mutate, factor and strsub to create the column WindMainDir.
use arrange to reorder the dataset.
weatherwind should print to :
A tibble: 667 x 3
WindDirTemp WindMainDir
ord dbl fct
1 N 0.1 N
2 E 0.8 E
3 SSE 1 S
4 SE1.2 S
5 SE1.4 S
6 NNW 1.4 N
7 NW1.8 N
8 N 2.1 N
9 ESE 2.6 E
10 SE2.7 S
??? with 657 more rows
Do not modify this line!

6. Plot the facetted histogram of the temperature Temp for each main
wind direction WindMainDir. Add a vertical line for temperature
29 degrees Celsius to see what winds bring hot temperatures.
To do this, you can use :
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot from weatherarrange, with
Temp as its xaxis
geomhistogram to create the histograms of Temperature, with
color set as black, fill as orange and binwidth as 0.5.
geomvline to draw the vertical line with aesxintercept29
and color set to green
facetwrap to facet over WindMainDir
labs to set :
title to The winds going west and north bring the highest temperatures
x to Temperature Celsius
themelight
Save this ggplot object to windhistogram.
Do not modify this line!

3
HW7: Reproducing a paper figure

This exercise was inspired by exercise 6 in Chapter 2 of
Bit By Bit: Social Research in the Digital Age
https:www.bitbybitbook.comen1stedobservingbehaviorobservingactivities
by Matt Salganik.

In a widely discussed paper, Michel and colleagues
2011https:doi.org10.1126science.1199644 analyzed the content of
more than five million digitized books in an attempt to identify longterm
cultural trends. The data that they used has now been released as the Google
NGrams dataset, and so we can use the data to replicate and extend some of
their work.

In one of the many results in the paper, Michel and colleagues argued that we
are forgetting faster and faster. For a particular year, say ???1883,??? they
calculated the proportion of all terms published in each year between 1875
and 1975 that were ???1883???. They reasoned that this proportion is a measure of
the interest in events that happened in that year. In their figure 3a, they
plotted the usage trajectories for three years: 1883, 1910, and 1950. These
three years share a common pattern: little use before that year, then a
spike, then decay.

They noticed the rate of decay for each year mentioned seemed to increase
with time and they argued that this means that we are forgetting the past
faster and faster.

The full paper can be found
herehttps:aidenlab.orgpapersScience.Culturomics.pdf, and you are
going to replicate part of figure 3a.

To do so we will focus on the mention of terms that can represent years
strings like 1765, 1886, 1897, 1937. The raw data was fetched
for you from the Google Books NGram Viewer website
http:storage.googleapis.combooksngramsbooksdatasetsv2.html and
preprocessed into two files:
datamentionsyearlycounts.tsv contains the number of mentions of
different terms per year and the number of books retrieved where the term
appeared each year
one row per term per year
datayearlytotalcounts.csv contains the total number of mentions of all
terms per year as well as the number of pages and books retrived each year
one row per year

1. Load the readr package.
Read in datamentionsyearlycounts.tsv using readtsv and assign
the resulting tibble to termsmentions.
Set the parameters of readtsv in order to make sure of the following:
column names should be, in order: term, year, nmentions,
bookcount
column types should be, in order: character, integer, integer,
integer
Hint: you can use parameters colnames and coltypes to achieve this.
termsmentions should print to:
A tibble: 53,393 x 4
term year nmentions bookcount
chr intintint
1 1817 1524 311
2 1817 1575 171
3 1817 160731
4 1817 163721
5 1817 166211
6 1817 167551
7 1817 169381
8 1817 170511
9 1817 170811
10 1817 171311
??? with 53,383 more rows
Do not modify this line!

2. Read in datayearlytotalcounts.csv using readcsv and assign the
resulting tibble to totalmentions.
Set the parameters of readcsv in order to make sure of the following:
column names should be, in order: year, totalmentions,
totalpagecount, totalbookcount
column types should be, in order: integer, double, integer,
integer
Hint: you can use parameters colnames and coltypes to achieve this.
Note: the reason you should read in the totalmentions as a double
column is that it contains very large integers that dont fit within the
bounds of numbers represented by the integer type in R. Using a
doubleprecision number is our only recourse.
totalmentions should print to:
A tibble: 425 x 4
year totalmentions totalpagecount totalbookcount
intdblintint
11505320592311
21507495864771
31515 289011 21971
41520517832231
51524 287177 12751
61525 3559 691
71527 4375 391
81541 5272 591
91563 2138439311
101564707553871
??? with 415 more rows
Do not modify this line!

3. Load the dplyr package. In order to join the totalmentions
Left join the totalmentions on termsmentions by year and assign
the resulting tibble to mentions.
Hint: you can use leftjoin.
mentions should print to:
A tibble: 53,393 x 7
term year nmentions bookcount totalmentions totalpagecount totalbookcount
chr intintintdblintint
1 1817 1524 311 287177 12751
2 1817 1575 171 186706 10671
3 1817 160731 381763 16002
4 1817 163721 681719 23153
5 1817 166211 239762 14713
6 1817 1675511644156 8918 14
7 1817 1693811038415 7426 16
8 1817 170511490874928840 60
9 1817 170811648115137416 70
10 1817 171311472064725961 77
??? with 53,383 more rows
Do not modify this line!

4. Check that your join was successful by using antijoin to drop all
observations in termsmentions that have a match in mentions and
assign the resulting tibble to diagnosis. If the join went as expected
diagnosis should be an empty tibble and print to:
A tibble: 0 x 4
??? with 4 variables: term chr, year int, nmentions int,
bookcount int
Do not modify this line!

5. Do the following:
starting with mentions, add a column fractotal that computes the
frequency of mentions of each term per year divides the number of
mentions of each term per year by the total number of mentions of all
terms that year,
select only columns term, year, nmentions, totalmentions and
fractotal.
Assign the resulting tibble to relativementioncounts.
Hint: you can use mutate and select.
relativementioncounts should print to:
A tibble: 53,393 x 5
term year nmentions totalmentionsfractotal
chr intintdbl dbl
1 1817 1524 31 287177 0.000108
2 1817 1575 17 186706 0.0000911
3 1817 16073 381763 0.00000786
4 1817 16372 681719 0.00000293
5 1817 16621 239762 0.00000417
6 1817 167551644156 0.00000304
7 1817 169381038415 0.00000770
8 1817 170514908749 0.000000204
9 1817 170816481151 0.000000154
10 1817 171314720647 0.000000212
??? with 53,383 more rows
Do not modify this line!

6. Load the forcats package.
To prepare the tibble to build the figure with:
keep only the terms 1883, 1910 and 1950,
transform the terms from characters to a factor in which the levels
are in reversed alphabetical order 1950, 1910 and 1883
Assign the result to examplesmentioncounts.
Hint: you can use filter, mutate and fctrev to reverse the
order of levels of a factor.
Note: the order matters to us to reproduce the same colors as the original
figure without setting them explicitely when generating the plot.
examplesmentionscounts should print to:
A tibble: 825 x 5
term year nmentions totalmentions fractotal
fct intintdbldbl
1 1883 15151 289011 0.00000346
2 1883 1520151783 0.0000193
3 1883 1524 15 287177 0.0000522
4 1883 1574462235 0.0000643
5 1883 15753 186706 0.0000161
6 1883 15841 151925 0.00000658
7 1883 16072 381763 0.00000524
8 1883 16375 681719 0.00000733
9 1883 16431 177489 0.00000563
10 1883 164431018174 0.00000295
??? with 815 more rows
Do not modify this line!

7. Load the ggplot2 and scales packages.
Generate a plot to reproduce the large window of figure 3a and assign the
result to paperfigure.
To do so, you can, in the following order:
create a plot from examplesmentioncounts using ggplot with the
appropriate columns assigned in the aesthetics x, y and color
arguments,
add the lines using geomline,
add scaleycontinuouslabelpercent to set the yaxis ticks to
the percent format,
limit the coordinates to show only the mentions across the timeframe
18502012 using coordcartesian and its argument xlim,
use labs to:
set title to Are we forgetting the past faster?
set x to Year
set y to Frequency of mention of each term
set color to Term
finally add themelight for a clear plot
Do not modify this line!

4
HW7: Relational data

In this exercise, you will familiarize yourself with
relational data and its manipulation.

Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.

1. Load the tidyverse, readr, and lubridate packages.
Use the function readcsv to read the dataset
booking.csv, guest.csv, rate.csv, and room.csv
from path dataguesthouse.
Store the corresponding dataframe into a tibble
booking, guest, rate, and room.
Do not modify this line!

2. Create a tibble roomearning from table booking and rate
using leftjoin.
For each bookingdate and roomno, there should be a column
earning containing the earning of this room.
The column bookingdate should be class Date.
To do that, you can use:
leftjoin to join tables booking and rate.
Note that you should join by two columns.
mutate to create a new column earning that
is the multiplication of nights and amount.
mdy to transform the bookingdate to class Date.
arrange to order the tibble.
select to select the columns.
The first rows of its print should be :
A tibble: 347 x 5
bookingdate roomno guestid nights earning
date dbldbldbl dbl
1 20161103 101 10277 336
2 20161103 102 11792 112
3 20161103 103 11062 144
4 20161103 104 12383 168
5 20161103 105 15407 588
6 20161103 106 10213 168
7 20161103 107 16233 168
8 20161103 108 1136156
9 20161103 109 15854 288
10 20161103 201 16136 288
??? with 337 more rows
Do not modify this line!

3. Create a tibble roomearningfiltered from roomearning.
This tibble will only have data whose roomno is
smaller than 200 and bookingdate between 20161115
and 20191215including both.
To do that, you can use:
filter to filter the roomno that is smaller than
200.
filter to filter the date.
mutate and as.character to transform the roomno
from numeric to character.
The first rows of its print should be :
A tibble: 76 x 5
bookingdate roomno guestid nights earning
date chrdbldbl dbl
1 20161115 101 1344148
2 20161115 105 11275 280
3 20161115 103 10414 288
4 20161115 109 1624172
5 20161115 102 15982 144
6 20161115 106 12085 280
7 20161116 101 1185148
8 20161116 109 12493 216
9 20161117 101 11875 240
10 20161117 104 14772 144
??? with 66 more rows
Do not modify this line!

4. Plot line plot: the earnings of room by date
using roomearningfiltered.
The x axis should be date labelled as Booking Date
with ticks 1115, 1201, 1215.
The y axis should be earning labelled as Earning dollars
with ticks 100, 200, 300 default setting.
To do that, you can use:
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot the bookingdate and earning column
of the roomearningfiltered.
Use aes to set parameters mapping.
geomline to plot the line plot.
scalexdate to set the ticks of x axis to
1115, 1201, 1215.
labs to format the labels such that:
titleMost rooms have earning around 100 to 300
xBooking date
yEarning dollars
facetwrap to be faceted by roomno.
themelight to set light theme i.e. a light backgroung.
Save the plot into roomearningplot.
Do not modify this line!

5. Create a tibble guestspending that stores the total spending and
total number of nights staying for each guest by their full name
from tibble guest and roomearning using rightjoin.
You can use :
mutate and paste to concatenate firstname and
lastname in guest.
rightjoin to join the aforementioned tibble with
roomearning.
groupby to group by name.
summarize to compute the total spending spending
and total number of nights staying nights.
ungroup to ungroup the tibble.
topn to select 10 rows with top spendings.
arrange to order the tibble by descending order of
total spending.
fctreorder to set name to factor and order the
level by ascending order of total spending.
The first rows of its print should be :
A tibble: 10 x 3
namenights spending
fctdbldbl
1 Sir Edward Garnier11780
2 Robert Halfon 10768
3 Angela Rayner 14744
4 Karin Smyth 11696
5 Sir Alan Haselhurst 11680
Do not modify this line!

6. Plot the bar plot of spending by name using
dataset guestspending.
The bars should be horizontal and ordered by total spending
from highest on top of the figure to lowest.
The x axis should be Spending dollars without label
with ticks 0, 200, 400, 600, 800default setting.
The y axis should beName with ticks
Sir Edward Garnier, Robert Halfon, Angela Rayner,
, Craig Tracey from top to bottomdefault setting.
To do that, use:
ggplot to initialize a ggplot object.
Set the date and mapping parameter correctly.
geomcol to plot the box plot.
coordflip to flip the axises.
labs to format the labels such that:
titleTop 10 guests all spent more than 600
ySpending dollars
xName
themelight to set light theme.
Save the plot into guestspendingplot.
Do not modify this line!

5
HW7: Relational data

In this exercise, you will familiarize yourself with
relational data and its manipulation.

Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.

3. Create a tibble roomearningfiltered from roomearning.
This tibble will only have data whose roomno is
smaller than 200 and bookingdate between 20191115
and 20191215including both.
To do that, you can use:
filter to filter the roomno that is smaller than
200.
filter to filter the date.
mutate and as.character to transform the roomno
from numeric to character.
The first rows of its print should be :
A tibble: 76 x 5
bookingdate roomno guestid nights earning
date chrdbldbl dbl
1 20161115 101 1344148
2 20161115 105 11275 280
3 20161115 103 10414 288
4 20161115 109 1624172
5 20161115 102 15982 144
6 20161115 106 12085 280
7 20161116 101 1185148
8 20161116 109 12493 216
9 20161117 101 11875 240
10 20161117 104 14772 144
??? with 66 more rows
Do not modify this line!

4. Plot line plot: the earnings of room by date
using roomearningfiltered.
The x axis should be date labelled as Booking Date
with ticks 1115, 1201, 1215.
The y axis should be earning labelled as Earning dollars
with ticks 100, 200, 300 default setting.
To do that, you can use:
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot the bookingdate and earning column
of the roomearningfiltered.
Use aes to set parameters mapping.
geomline to plot the line plot.
scalexdate to set the ticks of x axis to
1115, 1201, 1215.
labs to format the labels such that:
titleMost rooms have earning around 100 to 300
xBooking Date
yEarning dollars
facetwrap to be faceted by roomno.
themelight to set light theme i.e. a light backgroung.
Save the plot into roomearningplot.
Do not modify this line!

6
HW7: relationalregexggplot

Throughout the exercise:
Do NOT use for, while or repeat loops.
Useto structure your operations.
Use themelight for the plots.
For graphs with titles, make the format as
themeplot.titleelementtexthjust0.5, plot.subtitleelementtexthjust0.5.

1. Load the packages tidyverse and lubridate.
Use readcsv to read the datasets from data folder:
movies.csv into a tibble movie.
ratings.csv into a tibble ratings.
To check your solution, movies prints to:
A tibble: 27,254 x 3
movieId titlegenres
dbl chrchr
1 1 Toy Story 1995 AdventureAnimationChildrenComedyFanta..
2 2 Jumanji 1995 AdventureChildrenFantasy
3 3 Grumpier Old Men 1995ComedyRomance
4 4 Waiting to Exhale 1995 ComedyDramaRomance
5 5 Father of the Bride Part II 1995 Comedy
6 6 Heat 1995ActionCrimeThriller
7 7 Sabrina 1995 ComedyRomance
8 8 Tom and Huck 1995AdventureChildren
9 9 Sudden Death 1995Action
1010 GoldenEye 1995 ActionAdventureThriller
??? with 27,244 more rows
ratings prints to:
A tibble: 50,000 x 4
userId movieId ratingtimestamp
dbl dbldbldbl
136660 2804834049053
29186726575971578971
3 10725925662944169497
4 12933841612.5 1137405482
5 12869323554984620012
69798440523.5 1112061639
79770039494 1274039326
810443 948392 1420951497
98146241053 1092108684
10 6031 5934834163850
??? with 49,990 more rows
Do not modify this line!

2. Turn timestamp in the ratings dataset into normal format
e.g. yearmonthday.
To do this, you can use:
as.POSIXct to turn timestamp into
normal format, and specify the argument origin19700101.
withtzto change the time zone to UTC ,by passing
tzoneUTC.
Store the returned dataset into ratings.
To check your result, ratings prints to:
A tibble: 50,000 x 5
userId movieId ratingtimestamp date
dbl dbldbldbl dttm
136660 2804834049053 19960606 08:17:33
29186726575971578971 20001015 03:02:51
3 10725925662944169497 19991202 21:18:17
4 12933841612.5 1137405482 20060116 09:58:02
5 12869323554984620012 20010315 01:33:32
69798440523.5 1112061639 20050329 02:00:39
79770039494 1274039326 20100516 19:48:46
810443 948392 1420951497 20150111 04:44:57
98146241053 1092108684 20040810 03:31:24
10 6031 5934834163850 19960607 16:10:50
??? with 49,990 more rows
Do not modify this line!

3. Create a tibble averageratingindifferentyears of dimension 202.
The procedure should first create a new column called yearofrating that
gets the year of the rating and converted into a factor variable. Then we
group the dataset by yearofrating and summarize the dataset such that
we calculate the average ratings of
each year and store averages in a column called averageratingofyear.
To do that, you can use:
mutate and as.factor to coerce the year of date into a
factor variable.
groupby to group by yearofrating.
summarize to summarize the mean of ratings of different year and
store the value into averageratingofyear.
To check your result, averageratingindifferentyears prints to:
A tibble: 20 x 2
yearofrating averageratingofyear
fct dbl
1 1996 3.56
2 1997 3.56
3 1998 3.49
4 1999 3.59
5 2000 3.59
6 2001 3.52
7 2002 3.51
8 2003 3.50
9 2004 3.42
10 2005 3.41
??? with 10 more rows
Do not modify this line!

4. Draw a point plot of averageratingofyear vs. yearofrating.
Name the title as 2004 and 2005 have lower average ratings,
subtitle as While 2014 has highest average ratings.
To do this, you can use:
geompoint to draw a point plot of averageratingofyear
vs. yearofrating.
labs to name the title as
2004 and 2005 have lower average ratings,
the subtitle as While 2014 has highest average ratings,
the xaxis as Year,
the yaxis as Average rating.
Store the plot into a variable g1.
Do not modify this line!

5. Extract the year of movies from title column in the movies dataset.
Store the values into a new column called year and convert values into numeric.
To do this, you can use:
mutate, mapchr and as.numeric to generate a new column called year
and transform values into numeric.
hint: pay attention to substr to extract the characters that we want.
Store the returned dataset into movies.

Optional material: The title column format is relatively clean here. What if
we have 4digit years and 2digit years mix e.g. 1995, 96, 02, 2005, etc.?
Then the method above will not work and how should we modify the code or use other methods?
Way 1: Locate the index ofandthen use substr to extract years.
Way 2: Use regular expression. For instance, you can combine gsub with
pattern.09, replacement1, and xtitle in order to
extract the year.
Here, gsub matches to argument pattern within each element of a character
vector, and then replace the designed pattern.
The pattern is .09.
. means the string starts with a character and . means there are one or more characters.
andmeans there really exist parenthese.is just the syntax for escaping.
09 means there exists a string of characters that can be converted to numeric.
The replacement is 1.
1 means we only need to keep the characters that can be converted to numeric.
If interested, please go to the following website for more detials:
https:stringr.tidyverse.orgarticlesregularexpressions.html.

To check your result, movies prints to:
A tibble: 27,254 x 4
movieId title genres year
dbl chr chr dbl
1 1 Toy Story 1995AdventureAnimationChildrenComedyFa???1995
2 2 Jumanji 1995AdventureChildrenFantasy 1995
3 3 Grumpier Old Men 1995 ComedyRomance 1995
4 4 Waiting to Exhale 1995ComedyDramaRomance 1995
5 5 Father of the Bride Part II 1??? Comedy 1995
6 6 Heat 1995 ActionCrimeThriller1995
7 7 Sabrina 1995ComedyRomance 1995
8 8 Tom and Huck 1995 AdventureChildren 1995
9 9 Sudden Death 1995 Action 1995
1010 GoldenEye 1995ActionAdventureThriller1995
??? with 27,244 more rows
Do not modify this line!

6. Convert the genres column in movies dataset into characters.
Then separate the genres of a same movie such that the same movie is split
into several entries with different kinds of genres.
To do this, you can use:
mutate and as.character to convert genres into characters.
separaterows to split up the genres.
Store the returned dataset into movies.
To check your result, movies prints to:
A tibble: 54,374 x 4
movieId title genres year
dbl chr chr dbl
1 1 Toy Story 1995Adventure1995
2 1 Toy Story 1995Animation1995
3 1 Toy Story 1995Children 1995
4 1 Toy Story 1995Comedy 1995
5 1 Toy Story 1995Fantasy1995
6 2 Jumanji 1995Adventure1995
7 2 Jumanji 1995Children 1995
8 2 Jumanji 1995Fantasy1995
9 3 Grumpier Old Men 1995 Comedy 1995
10 3 Grumpier Old Men 1995 Romance1995
??? with 54,364 more rows
Do not modify this line!

7. Join the movies and ratings together by movieId. Remove movieId, userId
and timestamp columns after join.
To do this, you can use:
innerjoin to join two dataset.
select to remove specified columns.
Store returned dataset to ratingsofmovies.
To check your result, ratingsofmovies prints to:
A tibble: 132,725 x 5
titlegenres year rating date
chrchr dbldbl dttm
1 Toy Story 1995 Adventure19953.5 20050317 20:22:12
2 Toy Story 1995 Adventure19953 19990706 18:41:32
3 Toy Story 1995 Adventure19953 20090818 18:31:29
4 Toy Story 1995 Adventure19954.5 20090820 06:36:02
5 Toy Story 1995 Adventure19955 19970126 14:23:23
6 Toy Story 1995 Adventure19955 20010118 17:22:03
7 Toy Story 1995 Adventure19953.5 20040306 07:59:24
8 Toy Story 1995 Adventure19954 19961024 22:12:42
9 Toy Story 1995 Adventure19954 19971120 12:59:43
10 Toy Story 1995 Adventure19954 20050127 20:49:25
??? with 132,715 more rows
Do not modify this line!

8. Draw a horizontal boxplot of rating vs. genres.
To do this, you can use:
geomboxplot to draw a boxplot of rating vs. genres.
labs to name the title as:
Most movie genres have median ratings from 3.5 to 4,
name the xaxis as Genres,
name the yaxis as Ratings.
coordflip to flip x and y.
Store the plot into a variable g2.
Do not modify this line!

9. Group by the genres columns in ratingsofmovies. Then summrize the mean of
rating and number of rating of different genres. Store average of rating
values into a column averagerating, number of rating
into a column numberofratings.
Finally convert the genres into a factor variable.
To do this, you can use:
groupby to group by genres.
summarize to calculate the mean of
rating and number of rating of different genres. Store average of
rating values into a column averagerating, number of rating
into a column numberofratings.
mutate and as.factor to convert genres into a factor variable.
Store the returned dataset into averageratingofgenres.
To check your result, averageratingofgenres prints to:
A tibble: 20 x 3
genres averagerating numberofratings
fctdbl int
1 no genres listed4.51
2 Action3.43 13970
3 Adventure 3.49 10875
4 Animation 3.592807
5 Children3.384133
6 Comedy3.42 18701
7 Crime 3.688265
8 Documentary 3.74 588
9 Drama 3.67 22331
10 Fantasy 3.495252
??? with 10 more rows
Do not modify this line!

10.Draw a point plot of averagerating vs. numberofratings, colored by genres.
Draw a smooth curve that passes through points using loess method.
Name the title as
When number of ratings is over 5000, the average ratings start to be constant around 3.5.
To do this, you can use:
geompoint to draw a point plot of averagerating vs. numberofratings,
color by genres.
geomsmooth to draw a smooth curve using method as loess.
labs to name the title as
When number of ratings is over 5000, the average ratings start to be constant around 3.5.
the xaxis as Number of ratings,
the yaxis as Average rating,
the color legend as Genres.
Store the plot into a variable g3.
Do not modify this line!

11.Calculate the interval of time passed between the date corresponding the
rating and the release year from ratingsofmovies dataset,
and store this interval into a column called interval. You can assume
that a film released in a given year was actually released on January 1st.
Then, convert the units of interval into years and store its
values into a new column called yearspassed.
Finally, filter out rows with yearspassed less or equal to 60.
To do this, you can use:
mutate to create two required columns.
hint 1: pay attention to makedatetime to convert the release year
into a date and remembers that datesdatetimes can
be substracted to create intervals
hint 2: intervals can be divided by durations, and the dyears
function can help your compute the number of years that have passed
filter to filter out rows with yearspassed less or equal to 60.
Store the returned dataset to ratingsovertime.
To check your result, ratingsovertime prints to:
A tibble: 130,087 x 7
title genres year rating dateintervalyearspassed
chr chr dbldbl dttmdrtn dbl
1 Toy Story 199??? Adventure19953.5 20050317 20:22:12 3728.8488 da???10.2
2 Toy Story 199??? Adventure19953 19990706 18:41:32 1647.7788 da??? 4.51
3 Toy Story 199??? Adventure19953 20090818 18:31:29 5343.7719 da???14.6
4 Toy Story 199??? Adventure19954.5 20090820 06:36:02 5345.2750 da???14.6
5 Toy Story 199??? Adventure19955 19970126 14:23:23756.5996 da??? 2.07
6 Toy Story 199??? Adventure19955 20010118 17:22:03 2209.7236 da??? 6.05
7 Toy Story 199??? Adventure19953.5 20040306 07:59:24 3352.3329 da??? 9.18
8 Toy Story 199??? Adventure19954 19961024 22:12:42662.9255 da??? 1.82
9 Toy Story 199??? Adventure19954 19971120 12:59:43 1054.5415 da??? 2.89
10 Toy Story 199??? Adventure19954 20050127 20:49:25 3679.8677 da???10.1
??? with 130,077 more rows
Do not modify this line!

12.Draw a straight line plot of rating vs. yearspassed, colored by genres.
Name the title as The average rating usually increases with time,
subtitle as One exception is animation movies,
xaxis as Years between release and rating,
yaxis as Average rating,
legend as Genres.
To do this, you can use:
ggplot to setup your plot rating vs. yearspassed, colored by genres.
geomsmooth to draw a straight line plot using method as lm.
labs to name the title The average rating usually increases with time,
subtitle as One exception is animation movies,
xaxis as Years between release and rating,
yaxis as Average rating,
color legend as Genres.
Store the plot into a variable g4.

Do not modify this line!

7
HW7: NYC stock analysis

In this exercise, you will conduct complete data analysis on NYC stock price.

Dataset consists of following files:
prices.csv: raw, asis daily prices. Most of data spans from 2010 to the end 2016,
for companies new on stock market date range is shorter.
There have been approximmately 140 stock splits in that time,
this set doesnt account for that.
securities.csv: general description of each company with division on sectors
fundamentals.csv: metrics extracted from annual SEC 10K fillings 20122016,
should be enough to derive most of popular fundamental indicators.

1. Do the following:
load the readr, dplyr and tidyr package
load coursedataprices.csv using readcsv and save it to raw.
load datasecurities.csv using readcsv and save it to sectors.
load datafundamentals.csv using readcsv and save it to fund.
raw should look like:
A tibble: 851,264 x 7
datesymbolopen close lowhighvolume
dttmchrdbl dbl dbl dbl dbl
1 20160105 00:00:00 WLTW123.126.122.126. 2163600
2 20160106 00:00:00 WLTW125.120.120.126. 2386400
3 20160107 00:00:00 WLTW116.115.115.120. 2489500
4 20160108 00:00:00 WLTW115.117.114.117. 2006300
5 20160111 00:00:00 WLTW117.115.114.117. 1408600
6 20160112 00:00:00 WLTW116.116.114.116. 1098000
7 20160113 00:00:00 WLTW116.113.113.117.949600
8 20160114 00:00:00 WLTW114.114.110.115.785300
9 20160115 00:00:00 WLTW113.113.112.115. 1093700
10 20160119 00:00:00 WLTW114.110.110.116. 1523500
??? with 851,254 more rows
securities should look like:
A tibble: 505 x 8
Ticker symbol Security SEC filings GICS Sector
chr chrchr chr
1 MMM 3M Comp??? reports Industrials
2 ABT Abbott ??? reports Health Care
3 ABBVAbbVie reports Health Care
4 ACN Accentu??? reports Information ???
5 ATVIActivis??? reports Information ???
6 AYI Acuity ??? reports Industrials
7 ADBEAdobe S??? reports Information ???
8 AAP Advance??? reports Consumer Dis???
9 AES AES Corp reports Utilities
10 AET Aetna I??? reports Health Care
??? with 495 more rows, and 4 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr
fund should print to:
A tibble: 1,781 x 79
X1 Ticker Symbol Period Ending Accounts Payab???
dbl chr date dbl
1 0 AAL 201212313068000000
2 1 AAL 201312314975000000
3 2 AAL 201412314668000000
4 3 AAL 201512315102000000
5 4 AAP 201212292409453000
6 5 AAP 201312282609239000
7 6 AAP 201501033616038000
8 7 AAP 201601023757085000
9 8 AAPL20130928 36223000000
10 9 AAPL20140927 48649000000
??? with 1,771 more rows, and 75 more variables: Accounts
Receivable dbl, Addl incomeexpense items dbl, After Tax
ROE dbl, etc
Do not modify this line!

2. Load the stringr and lubridate packages.
Currently, the date in raw are in UTC time. We want to convert them
to New York time zone.
To do so, you can:
use mutate to convert the date column
use forcetz to format the date, with argument
tz set to AmericaNewYork.
Save the generated tibble into rawtime.
rawtimedate should be in this format:
1 20160105 EST 20160106 EST 20160107 EST 20160108 EST..
6 20160112 EST 20160113 EST 20160114 EST 20160115 EST..
Do not modify this line!

3. Load the package forcats.
In securities, keep the companies belonging to the top 6 sectors
by frequency of occurence, as well as those whose GICS Sub Industry
falls into Gold or Real Estate i.e., GICS Sub Industry contains
either Gold or REITs.
You need to do it in three steps:
First, use mutate and factor to convert the GICS Sector
variable of securities from character to a factor. Its levels should
be the unique values of GICS Sector.
Second, create a tibble named securitiessectored that contain
only the companies that do not belong to those that you want see below.
Note that securitiessectored should contain an additional column
GICS Sector truncated that contains the top 6 factors in GICS Sector
and all the others lumped into an additional level Other.
Third, use antijoin on securities and securitiessectored to
create securitiesselected, which contains only the rows that actually
meet the requirements above by deleting the rows from securities
that are in securitiessectored.
To achieve the second step, you can use:
mutate along with fctinfreq and fctlump to
reorder the sectors by frequency of occurence and lump
all except the top 6 into a single level Other.
filter to select the sectors that do not belong to the
top 6 i.e., the ones with the level Other.
filteralong with strdetect to additionally filter out
the observations whose GICS Sub Industry contains neither
Gold nor REITs. In the pattern, you can use a
regular expression with or represented by the alternation
symbolto do that.
To help you, securities is as in part 1, except that the GICS Sector
column is a factor whose levels print to
Levels: Industrials Telecommunications Services.
securitiessectored should print to:
A tibble: 94 x 9
Ticker symbol Security SEC filings GICS Sector
chr chrchr fct
1 AES AES Corp reports Utilities
2 APD Air Pro??? reports Materials
3 ALB Albemar??? reports Materials
4 LNT Alliant??? reports Utilities
5 AEE Ameren ??? reports Utilities
6 AEP America??? reports Utilities
7 AWK America??? reports Utilities
8 APC Anadark??? reports Energy
9 APA Apache ??? reports Energy
10 T ATT Inc reports Telecommunic???
??? with 84 more rows, and 5 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr, GICS Sector truncated fct
securitiesselected should print to:
A tibble: 411 x 8
Ticker symbol Security SEC filings GICS Sector
chr chrchr fct
1 MMM 3M Comp??? reports Industrials
2 ABT Abbott ??? reports Health Care
3 ABBVAbbVie reports Health Care
4 ACN Accentu??? reports Information ???
5 ATVIActivis??? reports Information ???
6 AYI Acuity ??? reports Industrials
7 ADBEAdobe S??? reports Information ???
8 AAP Advance??? reports Consumer Dis???
9 AET Aetna I??? reports Health Care
10 AMG Affilia??? reports Financials
??? with 401 more rows, and 4 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr
Do not modify this line!

4. Convert the column name of fund from Ticker Symbol
to Ticker symbol. This makes sure there is consistency between
column names of the different tables!.
Then create new column Period Ending Year to extract the year from
Period Ending. Then Drop NA values of fund. Select columns Ticker symbol,
Period Ending Year and Gross Margin.
Save the new tibble as fundtime.
To do that, you can use:
dplyr::rename to convert the column name.
mutate to create column Period Ending Year, inside mutate:
use strreplaceall to first convertto .
mdy with tz set to AmericaNewYork to convert the date
to EDT time zone.
year to extract the year.
You can use pipeon Period Ending inside mutate
dropna to drop the rows that contain NA values.
dplyr::select to select the intested columns.
fundtime should look like:
A tibble: 1,299 x 3
Ticker symbol Period Ending Year Gross Margin
chrdbldbl
1 AAL 2012 58
2 AAL 2013 59
3 AAL 2014 63
4 AAL 2015 73
5 AAP 2012 50
6 AAP 2013 50
7 AAP 2015 45
8 AAP 2016 45
9 AAPL2013 38
10 AAPL2014 39
??? with 1,289 more rows
Do not modify this line!

5. Select the following columns from securitiesselected:
Ticker symbol, Security, GICS Sector
Join the two tibbles securitiesselected and fundtime by Ticker symbol.
Drop the fundtime rows with NA if the corresponding Ticker symbol
in securitiesselected is not in fundtime.
To do that, you can use:
dplyr::select to select correponding columns.
innerjoin to automatically drop rows in a tibble when not matched
with other tibble. Set the argument by to Ticker symbol.
Save the concatenated tibble to securitiesfund.
securitiesfund should print to:
A tibble: 988 x 5
Ticker symbol Security GICS Sector Period Ending ???
chr chrfctdbl
1 MMM 3M Comp??? Industrials 2013
2 MMM 3M Comp??? Industrials 2014
3 MMM 3M Comp??? Industrials 2015
4 ABT Abbott ??? Health Care 2012
5 ABT Abbott ??? Health Care 2013
6 ABT Abbott ??? Health Care 2014
7 ABT Abbott ??? Health Care 2015
8 ABBVAbbVie Health Care 2013
9 ABBVAbbVie Health Care 2014
10 ABBVAbbVie Health Care 2015
??? with 978 more rows, and 1 more variable: Gross Margin dbl
Do not modify this line!

6. Load the library ggplot2.
Generate histograms of Gross Margin in different sectors in different periods
and assign the plot to variable grossmargin.
To do so, you can:
create the plot by calling ggplot on securitiesfund,
with mappingaes, in which argument fill should set to Period
Ending Year and x should set to Gross Margin.
adding geomhistogram with binwidth set to 10, color set to
black and fill set to orange.
then facet on GICS Sector using facetwrap.
labs to format the labels such that:
titleGross margin distributed differently in different sectors
xGross Margin
yCount n
then add the light theme using themelight.
Do not modify this line!

7. Join the two tibbles rawtime with securitiesfund to get each
companys stock price trend in each year with its corresponding
Gross Margin in that year.
Note: the column you want to join the tables by is Ticker symbol,
make sure they have the exact same name before joining.
To do that, you can do the following:
first convert the column name symbol of rawtime to Ticker symbol.
Hint: use dplyr::rename.
use mutate, and year to extract the year of the price happening.
use dplyr::select to select only the following columns:
Ticker symbol, close, open, date,year
use leftjoin with argument by set to Ticker symbol.
use filter to delete the rows where the Perior Ending Year is not
corresponding to the date of the price.
Save the generated tibble into fullstock.
fullstock should print to:
A tibble: 246,615 x 9
Ticker symbolclose open date year Security
chrdbldbl dttmdbl chr
1 AAL 5.12 5.220120103 00:00:002012 America???
2 AAP69.1 71.120120103 00:00:002012 Advance???
3 ABT56.7 56.620120103 00:00:002012 Abbott ???
4 ADS 103. 103. 20120103 00:00:002012 Allianc???
5 AKAM 32.9 33.020120103 00:00:002012 Akamai ???
6 ALK73.9 76.420120103 00:00:002012 Alaska ???
7 AME42.2 43.320120103 00:00:002012 AMETEK ???
8 AMT58.8 60.520120103 00:00:002012 America???
9 APH46.0 46.520120103 00:00:002012 Ampheno???
10 ARNC9.23 8.94 20120103 00:00:002012 Arconic???
??? with 246,605 more rows, and 3 more variables: GICS
Sector fct, Period Ending Year dbl, Gross Margin dbl
Do not modify this line!

8. Generate the stock close price trend plot in 20102016 of the
following company:
Aetna Inc, Amazon.com Inc, Facebook, Whole Foods Market,
FedEx Corporation, Boeing Company, The Walt Disney Company.
To do that, please do data manipulations first and create
a tibble filteredcompany. You can use:
filter and in to filter the selected companies.
filteredcompany should print to:
A tibble: 5,292 x 9
Ticker symbol closeopen date year Security
chr dbl dbl dttmdbl chr
1 AMZN257.256.20130102 00:00:002013 Amazon.???
2 BA 77.176.6 20130102 00:00:002013 Boeing ???
3 DIS51.150.8 20130102 00:00:002013 The Wal???
4 FB 2827.4 20130102 00:00:002013 Facebook
5 FDX94.293.5 20130102 00:00:002013 FedEx C???
6 WFM92.093.1 20130102 00:00:002013 Whole F???
7 AMZN258.257.20130103 00:00:002013 Amazon.???
8 BA 77.577.0 20130103 00:00:002013 Boeing ???
9 DIS51.251.0 20130103 00:00:002013 The Wal???
10 FB 27.827.9 20130103 00:00:002013 Facebook
??? with 5,282 more rows, and 3 more variables: GICS Sector fct,
Period Ending Year dbl, Gross Margin dbl
Do not modify this line!

9. Generate the stock close price trend plot in 2016 of the following company:
Aetna Inc, Amazon.com Inc, Facebook, Extra Space Storage,
FedEx Corporation, JPMorgan ChaseCo., Oracle Corp..
To do that, use:
ggplot on filteredcompany, with mappingaes,
in which argument y should set to close and x should set to date.
geomlineaes in which color is set to Security.
labs to format the labels such that:
titleSix company daily stock close price from 20102016
xdate
yDaily close price USD
use themelight to set the theme.
Do not modify this line!

10. Caculate the annual Rate of Return RoR on the securities in fullstock.
Rate of Return is defined as the net gain or loss on an investment
over a specified time period, calculated as a percentage of
the investment???s initial cost.
Namely, RoRcurrent valueinitial valueinitial value
To calculate this index on securities in fullstock, you can:
group the stock prices by Period Ending Year and Ticker symbol
using groupby
select the record of start of the year and end of the year
by using filter to select date equal to mindate or
maxdate
mutate the date to open if it is equal to mindate,
otherwise close. use ifelse inside mutate
use pivotlonger to extract open and close from date.
To do that, inside pivotlonger, set copen, close
as first argument, and then namesto as status, valuesto
as price. This will add two columns recording the opening
and closing price for each row.
use filter to select the right status for each row by
condition datestatus.
usesummarize to calculate the RoR for each stock in
each year. Inside summarize, use diff to calculate the
price difference of open price and close price and divide the
difference by price1 which represents the open price.
use leftjoin to add the annual RoR to securitiesfund
by joining with securitiesfund on Ticker symbol and
Period Ending Year.
Set by to cTicker symbol, Period Ending Year.
use droplevels drop the unselected sectors.
Save the generated tibble into returnstock.
returnstock should print to:
A tibble: 980 x 6
Groups: Period Ending Year 5
Period Ending ??? Ticker symbolReturn Security GICS Sector
dbl chr dbl chrfct
1 2012 AAL1.60 America??? Industrials
2 2012 AAP0.0170 Advance??? Consumer Dis???
3 2012 ABT0.158Abbott ??? Health Care
4 2012 ADS0.406Allianc??? Information ???
5 2012 AKAM 0.241Akamai ??? Information ???
6 2012 ALK 0.436Alaska ??? Industrials
7 2012 AME 0.133AMETEK ??? Industrials
8 2012 AMT0.278America??? Real Estate
9 2012 APH0.390Ampheno??? Information ???
10 2012 ARNC0.0291 Arconic??? Industrials
??? with 970 more rows, and 1 more variable: Gross Margin dbl
Do not modify this line!

11. Calculate the mean, 0.25 quantile and 0.75 quantile of Return for
each GICS Sector.
To do that, use:
groupby to group the stocks by GICS Sector.
summarize to calculate meanreturn using mean,
25 quantile q1 using quantile with probs set to 0.25,
75 quantile q2 using quantile with probs set to 0.75.
mutate to reorder the factor GICS Sector using fctreorder
according to meanreturn.
Save the generated tibble into summarystock.
The first four lines of summarystock should print to:
A tibble: 8 x 4
GICS Sectormeanreturn q1 q2
fctdbldbldbl
1 Industrials 0.1530.0291 0.315
2 Health Care 0.192 0.0325 0.323
3 Information Technology0.1960.006650.350
4 Consumer Discretionary0.1210.0513 0.294
Do not modify this line!

8
HW7: email

In this exercise, you will perform data analysis with emails within 184 people
from year 1998 to 2001.
1. Lets first read in the required datasets.
Load the readr package.
Use readcsv to load the people.csv data set from data folder
and assign it to a tibble people. This data set contains information
about the 184 people who sent emails between each other.
Use readcsv to load the email.csv data set from data folder
and assign it to a tibble email. This data set contains information about
each email sent: time, sender and receiver.
Do not modify this line!

2. onset variable in email is the time when the email is sent, but we can
see that it is shown in a weird way. It turns out that it represents how
many seconds from the start time. Lets convert it to normal
time stamps.
Note: 19980101 is encoded as 883612800 in onset.
Load the dplyr package.
Load the lubridate package
Use mutate to create a new variable time in email dataset.
We can obtain time by following the next three steps:
Substract the onset by 883612800 to get the seconds difference with
19980101.
Use as.POSIXct to change the seconds difference into normal timestamp.
Specify the argument origin19980101.
Use withtzto change the time zone to UTC ,by passing
tzoneUTC.
Save your generated tibble into emailwtime whose first few rows should
look like:
A tibble: 38,131 x 9
onset terminustailhead onset.censored terminus.censor??? duration edge.id
dbldbl dbl dbl lgllgl dbl dbl
9.58e8 9.58e83030 FALSEFALSE 0 1
9.59e8 9.59e83030 FALSEFALSE 0 1
9.59e8 9.59e83030 FALSEFALSE 0 1
9.64e8 9.64e83030 FALSEFALSE 0 1
9.70e8 9.70e83030 FALSEFALSE 0 1
9.70e8 9.70e83030 FALSEFALSE 0 1
9.73e8 9.73e83030 FALSEFALSE 0 1
9.74e8 9.74e83030 FALSEFALSE 0 1
9.79e8 9.79e83030 FALSEFALSE 0 1
9.85e8 9.85e83030 FALSEFALSE 0 1
??? with 38,121 more rows, and 1 more variable: date dttm
Do not modify this line!

3. Now lets take a look into the people dataset.
We can see some missing values in personname column, but we can get a
persons name using hisher emailid.
For example, Albert Meyerss email ID is just albert.meyers.
To fill in missing values with email ID, lets first create a function
emailidtoname to tranform albert.meyers into Albert Meyers.
You need to:
Load the stringr package.
Load the purrr package.
Create a functionfunction emailidtoname that take a string input
named emailid and returns the name extracted from the email id.
To do that, you can use:
strsplit to extract the first and last name from emailid,
split by ..
mapchar and paste0 to combine the first and last name using
collapse .
strtrim to remove whitespace from start and end of string.
strtotitle to capitalize each word.
Name your.
Do not modify this line!

4. Now lets implement the function you just created to our people tibble.
If personname is not missing, do not modify it. If it is missing, change
personname to the output of emailidtoname by taking emailid as
input, you can use emailidemailidtoname.
To do that, you can use:
mutate and ifelse to change the column personname.
is.na to check if the personname is missing.
Save your generated tibble into peoplenew, whose first few rows should
look like:
A tibble: 184 x 5
vertex.names emailidpersonnamerole dept
dblchr chrchrchr
1albert.meyers Albert MeyersEmployee Specialist
2a..martin Thomas MartinVice President NA
3andrea.ring Andrea RingNA NA
4andrew.lewisAndrew Lewis Director NA
5andy.zipper Andy ZipperVice President Enron Online
6a..shankman Jeffrey Shankman PresidentEnron Global Mkts
7barry.tycholizBarry Tycholiz Vice President NA
8benjamin.rogers Benjamin RogersEmployee Associate
9bill.rapp Bill RappNA NA
10 bill.williams Bill WilliamsNA NA
??? with 174 more rows
Do not modify this line!

5. We still have NA in role and dept. This time, we want to fill in the
missing values in role with Employee and missing values in dept as
General. Lets create a new tibble peoplenew2 that fills such gaps
and more!
First, load the tidyr package and create a vector roleorder
cEmployee, Trader, Manager, Managing Director, Director,
In House Lawyer, Vice President, President, CEO
Now, can fill the missing roles and make it a factor using the levels
in roleorder. To do that, you can use:
replacena to fill in the missing values in these two columns.
Remember the missing values in role should now become Employee
and the missing values in dept should now become General.
mutate to change role into a factor with factor and
specify levelsroleorder to change role into a factor according
to our order.
Save your generated tibble into peoplenew2 whose first few rows should
look like:
A tibble: 184 x 5
vertex.names emailidpersonnamerole dept
dblchr chrfctchr
1 albert.meyers Albert MeyersEmployee Specialist
2 a..martin Thomas MartinVice President General
3 andrea.ring Andrea RingEmployee General
4 andrew.lewisAndrew Lewis Director General
5 andy.zipper Andy ZipperVice President Enron Online
6 a..shankman Jeffrey Shankman PresidentEnron Global Mkts
7 barry.tycholizBarry Tycholiz Vice President General
8 benjamin.rogers Benjamin RogersEmployee Associate
9 bill.rapp Bill RappEmployee General
10bill.williams Bill WilliamsEmployee General
??? with 174 more rows
Do not modify this line!

6. Lets combine the two datasets emailwtime and peoplenew2 together,
we want to keep the information about every email and add the name, email ID,
role and department for the sender as well as for the receiver.
The numbers in tail and head represent different people. The key to join
this two tibbles are tail and head from emailwtime, and vertex.names
from peoplenew2. In other words, you need to use two leftjoin to
add first the information of the receiver and then of the sender.
You will also need to update the names of the columns added by the joins.
To add information of a receiver, you can use:
leftjoin to combine emailwtime and peoplenew2, specify
byctailvertex.names because tail represents the receiver
of an email.
rename to change the column names to specify that theyre the
receiver information: change emailid to receiveremail, personname
to receiver, role to receiverrole and dept to receiverdept.
Then, to add information of a sender, you can similarly use:
leftjoin to combine with peoplenew again, this time with
bycheadvertex.names .
rename to change the column names to specify that theyre the
sender information: change emailid to senderemail, personname
to sender, role to senderrole and dept to senderdept.
Finally, you can use:
select to only keep the date, sender information and receiver
information each has 4 columns with email ID, name, role and department.
startswith to select all the four columns of senderreceiver
information.
Save your generated tibble into t1, which should have 9 columns, in the order of
date, senderemail sender, senderrole, senderdept, receiveremail,
receiver, receiverrole, receiverdept.
The first fews rows of t1 should look like:
A tibble: 38,131 x 9
datesenderemail sender senderrole senderdept receiveremail
dttmchrchrfct chr chr
20000515 08:35:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20000518 04:15:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20000524 02:58:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20000719 07:09:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20000928 02:45:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20000928 02:52:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20001027 04:38:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20001110 02:52:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20010105 03:17:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
20010323 02:02:00 debra.perli??? Debra??? EmployeeGeneral debra.perling???
??? with 38,121 more rows, and 3 more variables: receiver chr,
receiverrole fct, receiverdept chr
Do not modify this line!

7. We noticed that there are emails that one person sent to himher self
with the same sendemail and receiveremail. We do not care about these
emails and want to filter them out.
Use filter to filter out rows in t1 whose senderemail is exactly
the same as receiveremail.
To simply our analysis with time the email is sent, we want to further
parse the information in date by creating new columns named year, month,
day and hour. To do that, you can use:
mutate to create the new columns year, month, day and hour.
year to extract year in date.
month to extract month in date.
day to extract day in date.
hour to extract hour in date.
Save your generated tibble into t2. The first few rows should look like:
A tibble: 34,427 x 13
datesenderemail sender senderrole senderdept receiveremail
dttmchrchrfct chr chr
20010315 02:43:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010402 13:44:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010605 22:40:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010611 05:20:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010308 02:52:00 kim.ward Kim W??? EmployeeGeneral jason.williams
20010328 07:40:00 kim.ward Kim W??? EmployeeGeneral jason.williams
20010328 18:40:00 kim.ward Kim W??? EmployeeGeneral jason.williams
20010402 07:53:00 kim.ward Kim W??? EmployeeGeneral jason.williams
20010402 17:53:00 kim.ward Kim W??? EmployeeGeneral jason.williams
20010403 07:24:00 kim.ward Kim W??? EmployeeGeneral jason.williams
??? with 34,417 more rows, and 7 more variables: receiver chr,
receiverrole fct, receiverdept chr, year dbl, month dbl, day int,
hour int
Now we have all the information we want. Lets try to answer some insteresting
questions in the following exercises:
Do not modify this line!

8. Who are the top 3 people who sent emails the most? What are their roles and
departments?
To answer this, you can use:
groupby to group by sender, senderrole and senderdept
summarize by specify countn to count the number of emails
arrange to sort the tibble by count of emails in decreasing order
head to extract the first three rows.
Save your generated tibble into p1 which should have the following structure:
A tibble: 3 x 4
sendersenderrolesenderdeptcount
chr fctchrint
Do not modify this line!

9. During which period of day are people tend to send emails?
Use:
ggplot to initialize a ggplot object.
Set its arguments data and mapping.
geomhistogram to plot a histogram for hour.
Specify bins24 to set the bins
labs to format the labels such that:
titlePeople send more emails during noon.
xHour.
yCountn.
themelight to change the theme of plots.
theme to change the subtitle to the middle of the plot as well.
Set its argument plot.title using elementtexthjust0.5.
Save your plot to g1.
Do not modify this line!

10. What is the trend of using emails? Do people use it more frequently in 1999 or 2001?
Lets first create a tibble to store the information.
Use:
filter to only keep the emails sent before 20020101.
groupby to group the data by year, month sender and
senderrole
summarize such that
datemindate keep record of the earliest date in each group.
countn the number of emails sent in that period.
arrange to order the rows by date.
Save your generated tibble into p2.
Do not modify this line!

11.Then lets visualize it. We want to plot the count against date
with colors spliting by senderrole.
Use:
ggplot to initialize a ggplot object.
Set its arguments data, mapping.
geompoint to plot a histogram for hour.
geomsmooth to add a smoothing regression line.
labs to format the labels such that:
titlePeople are using emails more frequently in 2001 than 1999.
xDate.
yCountn,
colorSender Role
themelight to change the theme of plots.
theme to change the subtitle to the middle of the plot as well.
Set its argument plot.title using elementtexthjust0.5.
Save your plot to g2.

Do not modify this line!

Whatsapp Us

[SOLVED] html graph statistic security 1

Reviews

Whatsapp Us

[SOLVED] html graph statistic security 1

Reviews

Related products

[Solved] BinaryAdd

[Solved] Modify your first program to print a table of the words

[Solved] Program that reads in the file climate_data_2017_numeric.csv

[Solved] Problem 3: Who are the Winners

[SOLVED] Programming Project for TCP Socket Programming

[Solved] Python Assignment-Financial Products and Markets