[SOLVED] 代写 R graph statistic Data, Project, and Plotting

30 $

File Name: 代写_R_graph_statistic_Data,_Project,_and_Plotting.zip
File Size: 489.84 KB

SKU: 1811195415 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Data, Project, and Plotting

Reading in data
• There are three often-used functions • scan()
– Most primitive, most flexible since it reads into a vector, and fast, use for large or very messy data
• read.table()
– Easiest to use, reads into a data frame
• read.csv()
– Most useful for reading in Excel worksheets or other comma-separated data

Manual entry of data
Nearly always data are read in from a text file like .csv, but data can be entered manually from the keyboard
> prices <- scan()1: 3162: 316.913: 317.634: 318.465:Read 4 items> prices
[1] 316.00 316.91 317.63 318.46

Reading in data from a file
• read.table() has a number of common options (useful defaults listed below)
– header=T first row has names for columns
– sep=” ” how are entries separated (white space)
– na.strings=NA which values are treated as NAs
– skip=0 the number of lines to skip before reading in data – nrows=-1 number of lines of data to read (-1 means all)
– col.names=c(“a”,”b”) names for columns
• Download data files in varying formats from UMMoodle.

Good format data file
• Download the file “dat_df1.dat”
• Put it into your current directory or a subdirectory
called “data”
Need double backslash for directories
read.table(file=“data\dat_df1.dat”, header=T)
Or a single forward slash
read.table(file=“data/dat_df1.dat”, header=T)
Get and set current working directory: getwd() and setwd(dir). read.table(file=“D:\data\dat_df1.dat”, header=T)
Use the full directory path
You can download data from a website
read.table(“http://xxx.xxxx.xxx/dat_df1.dat”, header=TRUE)

No column names
• Readin”data_df2.dat”whichismissingcolumn names
> read.table(file=”data/dat_df2.dat”)
V1 V2 V3 13112 M 26218 F …
> read.table(file=”data/dat_df2.dat”, col.names=c(“id”,”age”,”sex”))
id age sex 131 12 M 262 18 F …

Comments preceding data are ignored
• Readin”data_df3.dat”whichhascommentsinthe data file before the data
# Comments can precede the data # Fake patient data
id age sex
31 12 M

> read.table(file=”data/dat_df3.dat”, header=T)
id age sex 131 12 M 262 18 F …

Omit part of data file
• File “dat_df4.dat” has comments not preceded by #
> read.table(file=”data/dat_df4.dat”, header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements
> read.table(file=”data/dat_df4.dat”, header=T, skip=3)
id age sex 1 31 12 M 2 62 18 F …

Data not separated by white space • Readin”dat_df5.dat”
> read.table(file=”data/dat_df5.dat”, sep=”/”, header=T)
id age sex 131 12 M 262 18 F …
• Read in .csv file with commas
> read.table(file=”data/dat_df1.csv”, sep=”,”,
header=T)
Note: in standard R data format, the “sep” is a
space.

Data from Excel
• OpenfreshExcel workbook, single sheet, paste your data as values only, save as “Comma delimited (*.csv)”
• Read the data in using read.csv()
> read.csv(file=”data/dat_df1.csv”, header=T)
id age sex 131 12 M 262 18 F 350 20 F

Data should start in the top left corner
Excel sheet problems
All columns to the right and rows below should be empty and always have been empty hence the need for an empty sheet
Or delete all rows below and all columns to the right in Excel before saving as .csv

Reading in as text not factors
• In most cases, we want the text to be text rather than factors
> x <- read.csv(file=”data/dat_df1.csv”, header=T) > x$sex
[1] M F F M F M M F F M
Levels: F M
> x <- read.csv(file=”data/dat_df1.csv”, header=T,stringsAsFactors=F) > x$sex
stringsAsFactors controls how text is read in
[1] “M” “F” “F” “M” “F” “M” “M” “F” “F” “M”

Projects in RStudio
• Getting R to figure out where the files are (directories) is key when reading in data from files
• Youcanstoredifferentanalysesindifferentprojects and quickly switch between them
• Each project has a separate set of .r files that are open, and a separate R workspace (saved objects in console)

Housekeeping
• Steps you should take before running code – Written as a section at the top of scripts
– Remove all variables from memory
– Prevents errors such as use of older data

Commenting • # symbol tells R to ignore this
• commenting/documenting
– annotate someone’s script is good way to learn
– remember what you did
– tell collaborators what you did
– good step towards reproducible results

Section Headings
• You can make a section heading #Heading Name####
• Allows you to move quickly between sections

Working Directory
• Tells R where your scripts and data are
• type “getwd()” in the console to see your working directory
– RStudio automatically sets the directory to the folder containing your R project
• a“/”separatesfoldersandfile
• You can also set your working directory in the “session” menu

Creating a
1. Click the “File” menu button, then “New Project”.
2. Click “New Directory”.
3. Click “New Project”.
4. Type in the name of the directory to store your
self

contained project
project, e.g. “
5. Click the “Create Project” button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.
prj_testing1”.

Using projects
• The project will be saved as a file in that directory

prj_testing1“, named as “
prj_testing1.Rproj”
• Opening that will open the RStudio project
• Automatically sets the R working directory to that
directory

Creating a project
• Create a subfolder “data” under the directory

UMMoodle, and save it into the above directory
• Create a new R script file, name it, and save it into your project directory
• Add R code to the script file to read in the data file > read.csv(file=”./data/companies.csv”, header=T)
prj_testing1”
• Downloadfromthefile”companies.dat“from

• •
Best practices for project organization
Treat data as read only
– Data is expensive to collect. It is therefore a good idea to treat your data as “read-only”.
Data Wrangling
– In many cases your data will be “dirty” thus need to be
“cleaned”. It is also referred to as “
data
munging”. Recommendation: store these scripts in a separate
cleaning
or data

folder, and create a second “read-only” data folder to hold the “cleaned” data sets.
Treat generated output as disposable
– Anything generated by your scripts should all be able to be regenerated from your scripts. It is useful to have an output folder with different sub-directories for each separate analysis, since many analyses are exploratory-basis that will not be adopted in the end.

Good Enough Practices
• Put each project in its own directory, which is named after the project.
• Put text documents associated with the project in the doc directory.
• Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
• Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
• Name all files to reflect their content or function.
Source: https://github.com/swcarpentry/good-enough- practices-in-scientific-computing/blob/gh-pages/good- enough-practices-for-scientific-computing.pdf

Sub Directories
• You can have sub directories within your working directory to organize your data, plots, ect…
• “./” – will tell R to start from the current working directory
• e.g.Wehaveafoldercalled“data”inourworking directory
– “./data” – will tell R that you want to access files in that folder
• importdatainthedatafolderintoR: read.csv(file=“./data/companies.dat”, header=T)

Looking at Data
companies head(companies)
names(companies) str(companies) attributes(companies) ncol(companies) nrow(companies) summary(companies) plot(companies)
look at thewhole dataframe look at the first few rows
names of the columns in the dataframe structure of the dataframe
attributes of the dataframe number of columns number of rows summary statistics plot of all variable combinations

Exporting data
write.csv(companies,file=”./data/ companies_new.csv”)

Saving your Workspace
Save Clear
Reload
# Saving an R workspace file save.image(file=“My Project Data.RData”)
# Clear your memory rm(list = ls())
# Reload your data
Load(“My Project Data.RData”) head(companies) # looking good!

Graphics in R: overview
• There are several distinct ways of doing graphics in R

– lattice (used less now)

base graphics (changes to layout are fairly easy, highly modifiable)
ggplot2 (good for multipanel plots, quick alternative views of data, changes to basic layout can be difficult)

Base graphics: plot()
• Plot is the generic function for plotting R objects
– points, lines, etc.
• Read in the “companies.csv” data
> companies <- read.csv(file=”./data/companies.csv”, header=T)> companies
X StockPrice NumEmply
1 Amazon
2 Google
3 Telefonica
4Citygroup
5Microsoft
10.0 115 207.0 406 62.0 1320 6.8 179 52.2 440

Three ways to plot
Approach 1: specifying variables
> plot(x = companies$StockPrice, y = companies$NumEmply)
Approach 2: specify a “formula” using “~” operator
> plot(NumEmply ~ StockPrice, data = companies)
In the second line, where did object NumEmply come from?
The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes.
> attach(companies)
> plot(x = NumEmply, y = StockPrice) > detach(companies)

Labels on the axes
• By default R uses the name of the variables as axis labels
• Use the xlab and ylab options to change the labels plot(x = companies$StockPrice, y =
companies$NumEmply,
xlab = “Stock price (US$)”, ylab = “Number of employees”)

Limits of the axes
• By default, R chooses x and y limits that are just larger (4%) than the range of your data
• May or may not include zero
• To change the default x and y values use xlim and
ylim
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400))

Remove space around zero
• R adds space between the axis and 0
• This makes true zeros look like they are non-zeros
• Toremovethis,usexaxs=”i”andyaxs=”i”together with xlim and ylim
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400), xaxs=”i”, yaxs=”i”)

Using colors in R
• Colors of points, lines, text etc. can all be specified
• Colors can be applied to various plot parts – col (default color)
– col.axis (tick mark labels) – col.lab (x label and y label) – col.main (title of the plot)
• Colors can be specified as numbers or text strings – col=1orcol=”red”
plot(x = companies$StockPrice, y = companies$NumEmply,
xlim=c(0,300), ylim=c(0,1400), xaxs=”i”, yaxs=”i”, col=”blue”)

In-class exercise 1
Within the project you have created, copy, paste, and execute the following command into an R script
plot(x = companies$StockPrice,y = companies$NumEmply, xlim = c(0,300), ylim = c(0,1400),
cex = 2, pch = 19, col = “blue”)
1. Experiment with different color names: col=”red”
2. Try different color numbers: col=1, col=2
3. Try a vector of color numbers: col=c(2,4)
4. Experiment with changing the values for cex and pch
Compress your project folder (prj_testing1) into a
.
rar
/.zip package to submit!

Controlling point characteristics • Default is an open circle (pch=1) of size 1 (cex=1)

Pch is the short for plotting character. It controls the type of symbol, either an integer between 1 an d 25, or any single char within:

cex is the short for character expansion (a.k.a, size!)

Color naming • There are 657 named colors
colors()
point.colors <- c(“red”, “orange”, “green”, “blue”,”magenta”)plot(x = companies$StockPrice, y = companies$NumEmply,xlim=c(0,300), ylim=c(0,1400), cex=2, pch=19, col=point.colors)R color chartsOther Base Graphics # Lines and pointslines(x=seq(50,200,50), y=c(200,450,500,300),type = “b”) points(x=seq(100,250,50),y=seq(100,250,50))## Make a histogram of island datahist(islands)## Specify labelshist(islands,xlab = “Area of islands”,ylab = “Frequency”, col = “orange”, breaks = 10)Other Plotting Functions• The remaining slides are optional, because we will learn a much better plotting library instead! Useful options for points • Usefulcodeandoptionsunder?points• Use pch=21 for filled circles, for example:– Specify circle color with col– Specify fill color with bgplot(x = companies$StockPrice, y =companies$NumEmply,xlim=c(0,300), ylim=c(0,1400),cex=2, pch=23,col=”black”,bg=”salmon”) Set or Query Graphical Parameters: par() par: the current settings associated with plot()• If you ask for help on the plot() command using ?plot, only a handful of commands are listed• There are numerous extra commands listed under ?par that can be added to all plotting commands, not just plot()• Using par() by itself applies commands to multiple graphs> par()
$xlog [1] FALSE …

Using par() for global changes • Save default par values
old.par <- par()• Changetoanewvaluepar(col.axis=”red”)plot(x=companies$StockPrice, y=companies$NumEmply,xlim=c(0,300), ylim=c(0,1400),cex=2, pch=21, col=”black”, bg=”salmon”) plot(1:3)• Restoredefaults par(old.par)plot(1:3)Note: avoid global changes whenever possibleTo return to default plottingSelecting the Clear All command in the plotting window resets all figures and sets par() to the default values. Use this option when you have gone too far and can’t get back to a nice simple plotting screen.Vector options for plotting• Manyplottingoptionscanhandlevectors,each element applies to one point, are• Differentpointcharacters:pch=1:5• Differentpointletters:pch=c(“a”,”g”,”t”,”c”)• Differentcolors:col=1:5• Differentsizes:cex=1:5recycled First letter of each company’s name Circle size proportional to Number of employeesAdding legends• Legendsdonotuseanythingintheplot• Look up help on the legend function (?legend), notethat most options in par() can be used too legend(x=”topright”, also “bottomleft” etc., and can use x=100, y=100legend=companies[,1], vector of text strings pt.bg=1:5, background color of points pch=21:25, vector of symbol typebty=”n”)no box type If you want the legend to correspond to the plot, you need to specify identical symbols, sizes, and colors for the plot and the legendAxis properties• Tick mark labelling using yaxp and xaxp–plot(x = companies$StockPrice, y = companies$NumEmply,…yaxpc(min, max, number of spaces between intervals)= c(0, 1500, 3))• Try to replicate as closely as possible this graph• Colors are 1:5• Notetheaxes• Figureouthowto add a titleHands-on More advanced axis properties• For more control over axes, use the axis() function• First create the plot but suppress the x or y axis using xaxt=”n” and yaxt=”n”• Then add axes to whichever side they are neededplot(x = companies$StockPrice, y = companies$NumEmply, …yaxt = “n”)axis(side = 2, at = seq(0,1500,300),labels = c(0,300,600,900,1200,”>1500″))

Adding text using locator()
• Interactive function: click on the plot and it returns
the x and y coordinates
> location<-locator(1) > location$’x’
[1] 237.7858
> location$’y’
[1] 1357.598
Omit the 1 for multiple clicks, press to exit
Add text at those coordinates
text(location$’x’, location$’y’, label=“Your choice”)

Labeling points using text()
• Look up the help on ?text
• Can use vectors for x, y, and the text strings
• After creating the plot, call text() – pos=1 below
– pos=2 to the left
– pos=3 above
– pos=4 to the right
text(x = companies$StockPrice, y =
companies$NumEmply,
labels = companies[,1], pos=4)

Interactive point labeling
• If you don’t want to label all your points but there are a few outliers
plot(x = companies$StockPrice, y = companies$NumEmply,
…)
identify(x = companies$StockPrice, y = companies$NumEmply,
labels = companies[,1], n = 2)
• Click near n = 2 of the points

Points and lines
• ?lines gives values for lty, the line types
• Forlinewidthsuselwd
lwd values
lty values

More plot types
• In the plot() command, type specifies the type of
plot to be drawn – “p” points
– “l” lines
– “b” both lines and points
– “c” lines part alone of “b”
– “o” overplotted
– “h” histogram-like (or high-density) vertical lines – “s” stair steps
– “n” for no plotting

Adding points or lines
• You can add a series of points or lines to the current
plot using points() and lines() lines(x=seq(50,200,50), y=c(200,450,500,300),
type=”b”, lwd=3, lty=2)
points(x=seq(100,250,50), y=seq(100,250,50), cex=3, pch=17)

Hands-on
The equation for the standard normal density is exp(-x^2)/sqrt(2*pi)
Create the plot on the right to illustrate where 95% of the area falls:
-1.96 ≤ x ≤ 1.96
Hint: use type in two different ways

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 R graph statistic Data, Project, and Plotting
30 $