Data, Project, and Plotting
Reading in data
There are three often-used functions scan()
Most primitive, most flexible since it reads into a vector, and fast, use for large or very messy data
read.table()
Easiest to use, reads into a data frame
read.csv()
Most useful for reading in Excel worksheets or other comma-separated data
Manual entry of data
Nearly always data are read in from a text file like .csv, but data can be entered manually from the keyboard
> prices <- scan()1: 3162: 316.913: 317.634: 318.465:Read 4 items> prices
[1] 316.00 316.91 317.63 318.46
Reading in data from a file
read.table() has a number of common options (useful defaults listed below)
header=T first row has names for columns
sep= how are entries separated (white space)
na.strings=NA which values are treated as NAs
skip=0 the number of lines to skip before reading in data nrows=-1 number of lines of data to read (-1 means all)
col.names=c(a,b) names for columns
Download data files in varying formats from UMMoodle.
Good format data file
Download the file dat_df1.dat
Put it into your current directory or a subdirectory
called data
Need double backslash for directories
read.table(file=data\dat_df1.dat, header=T)
Or a single forward slash
read.table(file=data/dat_df1.dat, header=T)
Get and set current working directory: getwd() and setwd(dir). read.table(file=D:\data\dat_df1.dat, header=T)
Use the full directory path
You can download data from a website
read.table(http://xxx.xxxx.xxx/dat_df1.dat, header=TRUE)
No column names
Readindata_df2.datwhichismissingcolumn names
> read.table(file=data/dat_df2.dat)
V1 V2 V3 13112 M 26218 F
> read.table(file=data/dat_df2.dat, col.names=c(id,age,sex))
id age sex 131 12 M 262 18 F
Comments preceding data are ignored
Readindata_df3.datwhichhascommentsinthe data file before the data
# Comments can precede the data # Fake patient data
id age sex
31 12 M
> read.table(file=data/dat_df3.dat, header=T)
id age sex 131 12 M 262 18 F
Omit part of data file
File dat_df4.dat has comments not preceded by #
> read.table(file=data/dat_df4.dat, header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements
> read.table(file=data/dat_df4.dat, header=T, skip=3)
id age sex 1 31 12 M 2 62 18 F
Data not separated by white space Readindat_df5.dat
> read.table(file=data/dat_df5.dat, sep=/, header=T)
id age sex 131 12 M 262 18 F
Read in .csv file with commas
> read.table(file=data/dat_df1.csv, sep=,,
header=T)
Note: in standard R data format, the sep is a
space.
Data from Excel
OpenfreshExcel workbook, single sheet, paste your data as values only, save as Comma delimited (*.csv)
Read the data in using read.csv()
> read.csv(file=data/dat_df1.csv, header=T)
id age sex 131 12 M 262 18 F 350 20 F
Data should start in the top left corner
Excel sheet problems
All columns to the right and rows below should be empty and always have been empty hence the need for an empty sheet
Or delete all rows below and all columns to the right in Excel before saving as .csv
Reading in as text not factors
In most cases, we want the text to be text rather than factors
> x <- read.csv(file=”data/dat_df1.csv”, header=T) > x$sex
[1] M F F M F M M F F M
Levels: F M
> x <- read.csv(file=”data/dat_df1.csv”, header=T,stringsAsFactors=F) > x$sex
stringsAsFactors controls how text is read in
[1] M F F M F M M F F M
Projects in RStudio
Getting R to figure out where the files are (directories) is key when reading in data from files
Youcanstoredifferentanalysesindifferentprojects and quickly switch between them
Each project has a separate set of .r files that are open, and a separate R workspace (saved objects in console)
Housekeeping
Steps you should take before running code Written as a section at the top of scripts
Remove all variables from memory
Prevents errors such as use of older data
Commenting # symbol tells R to ignore this
commenting/documenting
annotate someones script is good way to learn
remember what you did
tell collaborators what you did
good step towards reproducible results
Section Headings
You can make a section heading #Heading Name####
Allows you to move quickly between sections
Working Directory
Tells R where your scripts and data are
type getwd() in the console to see your working directory
RStudio automatically sets the directory to the folder containing your R project
a/separatesfoldersandfile
You can also set your working directory in the session menu
Creating a
1. Click the File menu button, then New Project.
2. Click New Directory.
3. Click New Project.
4. Type in the name of the directory to store your
self
contained project
project, e.g.
5. Click the Create Project button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.
prj_testing1.
Using projects
The project will be saved as a file in that directory
prj_testing1, named as
prj_testing1.Rproj
Opening that will open the RStudio project
Automatically sets the R working directory to that
directory
Creating a project
Create a subfolder data under the directory
UMMoodle, and save it into the above directory
Create a new R script file, name it, and save it into your project directory
Add R code to the script file to read in the data file > read.csv(file=./data/companies.csv, header=T)
prj_testing1
Downloadfromthefilecompanies.datfrom
Best practices for project organization
Treat data as read only
Data is expensive to collect. It is therefore a good idea to treat your data as read-only.
Data Wrangling
In many cases your data will be dirty thus need to be
cleaned. It is also referred to as
data
munging. Recommendation: store these scripts in a separate
cleaning
or data
folder, and create a second read-only data folder to hold the cleaned data sets.
Treat generated output as disposable
Anything generated by your scripts should all be able to be regenerated from your scripts. It is useful to have an output folder with different sub-directories for each separate analysis, since many analyses are exploratory-basis that will not be adopted in the end.
Good Enough Practices
Put each project in its own directory, which is named after the project.
Put text documents associated with the project in the doc directory.
Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
Put source for the projects scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
Name all files to reflect their content or function.
Source: https://github.com/swcarpentry/good-enough- practices-in-scientific-computing/blob/gh-pages/good- enough-practices-for-scientific-computing.pdf
Sub Directories
You can have sub directories within your working directory to organize your data, plots, ect
./ will tell R to start from the current working directory
e.g.Wehaveafoldercalleddatainourworking directory
./data will tell R that you want to access files in that folder
importdatainthedatafolderintoR: read.csv(file=./data/companies.dat, header=T)
Looking at Data
companies head(companies)
names(companies) str(companies) attributes(companies) ncol(companies) nrow(companies) summary(companies) plot(companies)
look at thewhole dataframe look at the first few rows
names of the columns in the dataframe structure of the dataframe
attributes of the dataframe number of columns number of rows summary statistics plot of all variable combinations
Exporting data
write.csv(companies,file=./data/ companies_new.csv)
Saving your Workspace
Save Clear
Reload
# Saving an R workspace file save.image(file=My Project Data.RData)
# Clear your memory rm(list = ls())
# Reload your data
Load(My Project Data.RData) head(companies) # looking good!
Graphics in R: overview
There are several distinct ways of doing graphics in R
lattice (used less now)
base graphics (changes to layout are fairly easy, highly modifiable)
ggplot2 (good for multipanel plots, quick alternative views of data, changes to basic layout can be difficult)
Base graphics: plot()
Plot is the generic function for plotting R objects
points, lines, etc.
Read in the companies.csv data
> companies <- read.csv(file=”./data/companies.csv”, header=T)> companies
X StockPrice NumEmply
1 Amazon
2 Google
3 Telefonica
4Citygroup
5Microsoft
10.0 115 207.0 406 62.0 1320 6.8 179 52.2 440
Three ways to plot
Approach 1: specifying variables
> plot(x = companies$StockPrice, y = companies$NumEmply)
Approach 2: specify a formula using ~ operator
> plot(NumEmply ~ StockPrice, data = companies)
In the second line, where did object NumEmply come from?
The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes.
> attach(companies)
> plot(x = NumEmply, y = StockPrice) > detach(companies)
Labels on the axes
By default R uses the name of the variables as axis labels
Use the xlab and ylab options to change the labels plot(x = companies$StockPrice, y =
companies$NumEmply,
xlab = Stock price (US$), ylab = Number of employees)
Limits of the axes
By default, R chooses x and y limits that are just larger (4%) than the range of your data
May or may not include zero
To change the default x and y values use xlim and
ylim
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400))
Remove space around zero
R adds space between the axis and 0
This makes true zeros look like they are non-zeros
Toremovethis,usexaxs=iandyaxs=itogether with xlim and ylim
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400), xaxs=i, yaxs=i)
Using colors in R
Colors of points, lines, text etc. can all be specified
Colors can be applied to various plot parts col (default color)
col.axis (tick mark labels) col.lab (x label and y label) col.main (title of the plot)
Colors can be specified as numbers or text strings col=1orcol=red
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400), xaxs=i, yaxs=i, col=blue)
In-class exercise 1
Within the project you have created, copy, paste, and execute the following command into an R script
plot(x = companies$StockPrice,y = companies$NumEmply, xlim = c(0,300), ylim = c(0,1400),
cex = 2, pch = 19, col = blue)
1. Experiment with different color names: col=red
2. Try different color numbers: col=1, col=2
3. Try a vector of color numbers: col=c(2,4)
4. Experiment with changing the values for cex and pch
Compress your project folder (prj_testing1) into a
.
rar
/.zip package to submit!
Controlling point characteristics Default is an open circle (pch=1) of size 1 (cex=1)
Pch is the short for plotting character. It controls the type of symbol, either an integer between 1 an d 25, or any single char within:
cex is the short for character expansion (a.k.a, size!)
Color naming There are 657 named colors
colors()
point.colors <- c(“red”, “orange”, “green”, “blue”,”magenta”)plot(x = companies$StockPrice, y = companies$NumEmply,xlim=c(0,300), ylim=c(0,1400), cex=2, pch=19, col=point.colors)R color chartsOther Base Graphics # Lines and pointslines(x=seq(50,200,50), y=c(200,450,500,300),type = “b”) points(x=seq(100,250,50),y=seq(100,250,50))## Make a histogram of island datahist(islands)## Specify labelshist(islands,xlab = “Area of islands”,ylab = “Frequency”, col = “orange”, breaks = 10)Other Plotting Functions The remaining slides are optional, because we will learn a much better plotting library instead! Useful options for points Usefulcodeandoptionsunder?points Use pch=21 for filled circles, for example: Specify circle color with col Specify fill color with bgplot(x = companies$StockPrice, y =companies$NumEmply,xlim=c(0,300), ylim=c(0,1400),cex=2, pch=23,col=”black”,bg=”salmon”) Set or Query Graphical Parameters: par() par: the current settings associated with plot() If you ask for help on the plot() command using ?plot, only a handful of commands are listed There are numerous extra commands listed under ?par that can be added to all plotting commands, not just plot() Using par() by itself applies commands to multiple graphs> par()
$xlog [1] FALSE
Using par() for global changes Save default par values
old.par <- par() Changetoanewvaluepar(col.axis=”red”)plot(x=companies$StockPrice, y=companies$NumEmply,xlim=c(0,300), ylim=c(0,1400),cex=2, pch=21, col=”black”, bg=”salmon”) plot(1:3) Restoredefaults par(old.par)plot(1:3)Note: avoid global changes whenever possibleTo return to default plottingSelecting the Clear All command in the plotting window resets all figures and sets par() to the default values. Use this option when you have gone too far and cant get back to a nice simple plotting screen.Vector options for plotting Manyplottingoptionscanhandlevectors,each element applies to one point, are Differentpointcharacters:pch=1:5 Differentpointletters:pch=c(“a”,”g”,”t”,”c”) Differentcolors:col=1:5 Differentsizes:cex=1:5recycled First letter of each companys name Circle size proportional to Number of employeesAdding legends Legendsdonotuseanythingintheplot Look up help on the legend function (?legend), notethat most options in par() can be used too legend(x=”topright”, also “bottomleft” etc., and can use x=100, y=100legend=companies[,1], vector of text strings pt.bg=1:5, background color of points pch=21:25, vector of symbol typebty=”n”)no box type If you want the legend to correspond to the plot, you need to specify identical symbols, sizes, and colors for the plot and the legendAxis properties Tick mark labelling using yaxp and xaxpplot(x = companies$StockPrice, y = companies$NumEmply,…yaxpc(min, max, number of spaces between intervals)= c(0, 1500, 3)) Try to replicate as closely as possible this graph Colors are 1:5 Notetheaxes Figureouthowto add a titleHands-on More advanced axis properties For more control over axes, use the axis() function First create the plot but suppress the x or y axis using xaxt=”n” and yaxt=”n” Then add axes to whichever side they are neededplot(x = companies$StockPrice, y = companies$NumEmply, …yaxt = “n”)axis(side = 2, at = seq(0,1500,300),labels = c(0,300,600,900,1200,”>1500))
Adding text using locator()
Interactive function: click on the plot and it returns
the x and y coordinates
> location<-locator(1) > location$x
[1] 237.7858
> location$y
[1] 1357.598
Omit the 1 for multiple clicks, press
Add text at those coordinates
text(location$x, location$y, label=Your choice)
Labeling points using text()
Look up the help on ?text
Can use vectors for x, y, and the text strings
After creating the plot, call text() pos=1 below
pos=2 to the left
pos=3 above
pos=4 to the right
text(x = companies$StockPrice, y =
companies$NumEmply,
labels = companies[,1], pos=4)
Interactive point labeling
If you dont want to label all your points but there are a few outliers
plot(x = companies$StockPrice, y = companies$NumEmply,
)
identify(x = companies$StockPrice, y = companies$NumEmply,
labels = companies[,1], n = 2)
Click near n = 2 of the points
Points and lines
?lines gives values for lty, the line types
Forlinewidthsuselwd
lwd values
lty values
More plot types
In the plot() command, type specifies the type of
plot to be drawn p points
l lines
b both lines and points
c lines part alone of b
o overplotted
h histogram-like (or high-density) vertical lines s stair steps
n for no plotting
Adding points or lines
You can add a series of points or lines to the current
plot using points() and lines() lines(x=seq(50,200,50), y=c(200,450,500,300),
type=b, lwd=3, lty=2)
points(x=seq(100,250,50), y=seq(100,250,50), cex=3, pch=17)
Hands-on
The equation for the standard normal density is exp(-x^2)/sqrt(2*pi)
Create the plot on the right to illustrate where 95% of the area falls:
-1.96 x 1.96
Hint: use type in two different ways
Reviews
There are no reviews yet.