The goal of this project is to redo the linear modeling of only the best model from your Project 1 but on an Apache Spark platform. You should use Rs sparklyr package to talk to an Apache Spark instance. The two tasks are:
- Install sparklyr and Apache Spark on your computer
- Run R code that uses Spark to redo the linear modeling
Note that it is not required to redo the exploratory data analysis in Project 1.
Installing sparklyr and Apache Spark :
It is easiest to install from within RStudio (assuming that the tidyverse library is also installed).
- Install package sparklyr
- Load the sparklyr library and install Apache Spark using sparklyr:
library(sparklyr)
spark_install()
This should work in either Windows or Linux. More detailed instructions for installing on Linux from scratch is included at the end.
Linear modeling in Spark
This code will be very similar to the code shown in class. Use the same data file from Project 1.
> mylocaldata <- read_csv
(http://staff.pubhealth.ku.dk/~tag/Teaching/share/data/Bodyfat.csv)
> library(sparklyr)
> sc <- spark_connect(master = local)
> myremotedata <- copy_to(sc, mylocaldata )
> mymodel <- ml_linear_regression(x=myremotedata , formula = bodyfat ~ Weight + Height) > summary(mymodel)
You will edit the above code to use the specific variables you used and perform any transformations that you used.
Appendix. Installation on Linux/Tuffix:
You may want to try out Tuffix, the Titan-branded version of Ubuntu 18.04. Instructions on how to install Tuffix or a Tuffix-based VM are in the Tuffix Titanium Community for Students, https://communities.fullerton.edu/course/view.php?id=1547 (also the best venue to receive help with Tuffix). It is easiest to install into a Linux (virtual) machine. Then install R or Rstudio, install the sparklyr package inside R/Rstudio, and then use sparklyrs install.spark() function to do the Spark installation.
- To install R, from the Linux command line:
> sudo apt install r-base
- To install RStudio:
> Download the latest version (as a .deb file) from https://www.rstudio.com/products/rstudio/download/#download
> sudo apt install gdebi
> sudo gdebi <location of downloaded rstudio .deb file>
- Make sure Java 8 is used.
> sudo apt install openjdk-8-jdk
> sudo update-alternatives config java
This will show all currently installed versions of Java. Select Java 8 (openjdk-8-jdk)
- Install the sparklyr package inside R/Rstudio:
> install.packages(tidyverse)
If there are errors during installation of tidyverse, make sure these libraries are installed
- sudo apt-get install libxml2-dev
- (for package xml2)
- sudo apt-get install libcurl4-gnutls-dev
- (for package curl)
> install.packages(sparklyr)
- Install spark from inside R/Rstudio[1]
> library(sparklyr)
> install.spark()
[1] The above instructions install spark to a local folder. You can also install to a system-wide folder, a install a specific version of Spark, or use an existing installation of Spark.
Reviews
There are no reviews yet.