Name: [SOLVED] STAT340: Discussion 6: Earthquake frequencies
Brand: Assignment Chef
SKU: 8229826928
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

5/5 - (1 vote)

title:STAT340: Discussion 6: Earthquake frequencies
documentclass: article
classoption: letterpaper
html_document:

highlight: tango
fig_caption: false

table{width:50%!important;margin-left:auto!important;margin-right:auto!important;}
ol[style*=decimal]>li{margin-top:40px!important;}

`{r setup, include=FALSE}
# check packages installed
knitr::opts_chunk$set(echo=T,tidy=F,strip.white=F,fig.align=center,comment= #,message=F,warning=F)
options(width=120)

[link to source](ds06.Rmd)

## XKCD comic

## Exercise

Today, we will be using *real data* from the [US Geological Survey (USGS)](https://www.usgs.gov/natural-hazards/earthquake-hazards/earthquakes) to build a model to **estimate the frequency of earthquakes** across the world. As usual, we encourage you to work in small groups, but this is up to you.

### Background

In seismology (the study of earthquakes), the relationship between the frequency and magnitude of earthquakes (in a certain place and period of time) can be modeled by the [Gutenberg-Richter law](https://en.wikipedia.org/wiki/Gutenberg%E2%80%93Richter_law) (GR law). Let $M$ be the [Richter magnitude](https://en.wikipedia.org/wiki/Richter_magnitude_scale) of a seismic event, and $N$ be the number of events with magnitude ***greater than or equal to $M$***. Then, the GR law states that $$N=10^{a-bM}$$ or in other words $$log_{10}(N)=a-bM$$ where $a$ and $b$ are unknown constants that depend on the particular choice of place/period. Note that this relationship should appear as a line on a plot of $log_{10}(N)$ vs $M$.

### Data import

This dataset contains every earthquake (on Earth) of magnitude 4.5 or greater that was detected by USGS from beginning of 2011 to end of 2020. A detailed description of the columns can be [**found here**](https://earthquake.usgs.gov/data/comcat/data-eventterms.php), but the **main variables** we are interested in **are `time` and `mag.binned`** (magnitude rounded to nearest half integer).

For convenience, much of the data cleaning and preprocessing has already been done for you; [**download the prepared data here**](https://kdlevin-uwstat.github.io/STAT340-Fall2021/discussion/ds06/quakes.csv.gz) and **save into the same directory as this source file**. Then, uncomment and run the lines of code below to import the data.

# library(tidyverse)
# library(lubridate)
# parseTime = function(times){
# as.POSIXct(strptime(times,%Y-%m-%d %H:%M:%OS,tz=UTC))
# quakes = read_csv(quakes.csv.gz, col_types=TddddcddddccTccddddccc) %>%
# mutate_at(c(time,updated),parseTime)

# col_types tells R precisely what the column types are, to avoid wrong type errors
# parseTime here is a function to help convert the datetime strings in this file to datetime format
# mutate_at is used to apply parseTime to both the time and updated columns
# also, recall that .gz means its been compressed using the gz program, which saves a lot of space,
# but R can still directly read it as if its an ordinary csv file (underrated feature)

For our purposes, this data needs to be summarized to obtain $N$ vs $M$ values for each year, where $N$ is the number of earthquakes with magnitude at least as strong as a given value of $M$ (see wiki page on [GR law](https://en.wikipedia.org/wiki/Gutenberg%E2%80%93Richter_law) for more details). Since this is a bit tricky, its also been done for you below.

# quakes.count = quakes %>%
# count(year,mag.binned,name=count) %>%
# group_by(year) %>%
# mutate(N=rev(cumsum(rev(count))), year=as.character(year))

# count(year,mag.binned) counts how many events with that magnitude occurred each year
# (see https://dplyr.tidyverse.org/reference/count.html for more info)
# group_by(year) followed by cumsum() takes the cumulative sum in each year
# the rev(rev()) runs this cumsum in reverse (otherwise we get M instead of M)

Before moving onto the next step, inspect the data frame to **make sure you completely understand what it represents** and check that everything looks right. (Your first and last rows should look something like `2011, 4.5,7430, 10636` and `2020, 7, 10, 10`).

# print(quakes.count,n=Inf)

### Visualization

As usual, the **first step is to visualize the data**. Make a **scatter plot** of $log_{10}(N)$ vs $M$ (note this means $M$ is on the horizontal axis (you say $y$ vs $x$, NOT $x$ vs $y$)). Then, **add a line plot on top**, making sure the **years are correctly grouped** together and distinguished from each other (use something like color or linetype (or both!), or even something else, (use your judgment to determine what looks best)).

_Note: you can either use `log10(N)` as the `y` aesthetic, OR directly use `y=N` and just rescale the axis to be logarithmic using `scale_y_log10()`. I recommend this second method since it makes the axis easier to read ([see here for an example](https://stackoverflow.com/a/9223257))._

Ideally, it might look something like this (dont forget to add nice title/labels!)

# ggplot(quakes.count,aes()) +

### Estimation

Next, we will fit a simple linear regression to the data to estimate $a$ and $b$. Complete the line of code below to fit the model (dont forget the linear relationship is NOT between $N$ and $M$ but rather between $log_{10}(N)$ and $M$, so adjust your model formula accordingly!).

# lm.quakes = lm()

View a summary of the model to see coefficients estimates, $p$-values, and other relevant info.

# summary(lm.quakes)

From your fit, what are your estimates for the constants $a$ and $b$? Pay **careful attention** to the signs here! (hint: remember the GR law uses the convention $a-bM$ whereas R assumes you are fitting `intercept + slope * M`, so therefore your fitted slope = $-b$).

Try to avoid copy-pasting or manually typing in values. The `coef()` function lets you extract coefficients from a model object (e.g. `lm.quakes`). You can then use `[i]` to access the i-th value of this coefficients vector. You may also want to use `unname()` at the end to remove the name from the value (if you dont, it may carry through later calculations and no longer be a correct name for the output value).

# a = ?
# b = ?

(Hint: if you did this correctly, `a+b` should evaluate to approximately `9.68`)

### Checking fit

Its always **nice to visually check your fit** to make sure everything looks right. Plot your **line of best fit along with the data points** in the chunk below.

_Hint: this time, I recommend using `log10(N)` as the `y` aesthetic, then using [`geom_abline()`](https://ggplot2.tidyverse.org/reference/geom_abline.html) to plot your line of regression using your estimated slope and intercept values (this avoids dealing with distorted axis cause by the other methods `scale_y_log10()` which can be non-intuitive to deal with). Note you can use your variables `a` and `b` here that were defined previously to avoid manually typing in numerical estimates (which is almost always bad!)._

# ggplot(quakes.count,aes()) +

You can also check [the residuals of your fit](https://rpubs.com/iabrady/residual-analysis). This is fairly convenient to do in base R. Note there is some **heteroscedasticity** due to the fact that higher magnitude earthquakes occur much less often than lower magnitude earthquakes so its harder to estimate them as precisely. **This is expected** and not a huge problem. Normality is mostly satisfied.

# par(mfrow=c(1,2))
# plot(lm.quakes,which=1:2)

### Confidence & prediction

Give $95%$ confidence intervals for $a$ and $b$ ([help page](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/confint)).

# confint()

Give a brief interpretation of these intervals.