COMP 5070
Statistical Programming for Data Science
Assignment 2DUE by 11:00pm CST, Sunday, November 3
This assignment is worth 35 of your overall grade.
It is recommended to submit Rcode file or two files as you have two tasks and decide to do them separately and a word document with a nicelooking report. Please think about your reader, make it easy to see that your job has been done perfect.
If you have more than one file, you can submit your assignment as a compressed file e.g. .zip, .tar.gz, .gz using Gradebook.This compressed file should include ALL code needed to run your program and any other files you created yourself. You do NOT need to include the data files provided to you, as it can be safely assumed I have them too.
The assignment is out of 100 marks. To obtain the maximum available marks you should aim to:
Code all requested components 45
Aim for optimised code in terms of computational overhead 5. It is not always possible to avoid loops, however you should aim to avoid loops where possible hint: in R, the apply suite of functions helps us vectorise code as much as possible.
Use a clear coding style 5. Code clarity is an important part of your submission. Thus you should choose meaningful variable names and adopt the use of commentsyou dont need to comment every single line, as this will affect readabilityhowever you should aim to comment at least each section of code.
Have the code run successfully 5.
Output the information in a presentable manner as decided by yourself and present the requested statistical analysesdiscussions 35.
Document code limitations including, but not limited to, the requested functionalities 5.
Deadline for the Assignment 2 is the last week of the study period. Hence, there is no room for extensions. Use your time wisely. Plagiarism is a specific form of academic misconduct. Although the University encourages discussing work with others and the Social Forum will support this, ultimately this assignment is to represent your individual work. If plagiarism is found, all parties will be penalised. You should retain copies of all assignment computer files used during development of the solution to Assignment 2. These files must remain unchanged after submission, for the purpose of checking if required.
COMP 5070
Statistical Programming for Data Science
Assignment 2DUE by 11:00pm CST, Sunday, November 3
This assignment is worth 35 of your overall grade.
It is recommended to submit Rcode file or two files as you have two tasks and decide to do them separately and a word document with a nicelooking report. Please think about your reader, make it easy to see that your job has been done perfect.
If you have more than one file, you can submit your assignment as a compressed file e.g. .zip, .tar.gz, .gz using Gradebook.This compressed file should include ALL code needed to run your program and any other files you created yourself. You do NOT need to include the data files provided to you, as it can be safely assumed I have them too.
The assignment is out of 100 marks. To obtain the maximum available marks you should aim to:
Code all requested components 45
Aim for optimised code in terms of computational overhead 5. It is not always possible to avoid loops, however you should aim to avoid loops where possible hint: in R, the apply suite of functions helps us vectorise code as much as possible.
Use a clear coding style 5. Code clarity is an important part of your submission. Thus you should choose meaningful variable names and adopt the use of commentsyou dont need to comment every single line, as this will affect readabilityhowever you should aim to comment at least each section of code.
Have the code run successfully 5.
Output the information in a presentable manner as decided by yourself and present the requested statistical analysesdiscussions 35.
Document code limitations including, but not limited to, the requested functionalities 5.
Deadline for the Assignment 2 is the last week of the study period. Hence, there is no room for extensions. Use your time wisely. Plagiarism is a specific form of academic misconduct. Although the University encourages discussing work with others and the Social Forum will support this, ultimately this assignment is to represent your individual work. If plagiarism is found, all parties will be penalised. You should retain copies of all assignment computer files used during development of the solution to Assignment 2. These files must remain unchanged after submission, for the purpose of checking if required.
Q1 Detecting Fraudulent Data 40
How To Detect Fraud in Large Data Sets?
source: businesslife.ba.com
In 1938, a physicist named Benford analysed 20,229 sets of numbers from all sorts of categories including the areas of rivers, baseball statistics and numbers collected from an issue of Readers Digest.
The result was somewhat surprising: numbers with a first digit of 1 appeared up to 30 of the time, much more than might be expected.
On the other hand, larger numbers appeared in the first digit position less frequently.For example, 9 as a first digit occurred less than 5 of the time.
Benfords Law is often called the First Digit Law, and is used today quite widely in the area of detecting fraud, particularly in large data sets.
Background Information
While Benfords Law is not applicable to every single data set, the type of data sets that can be analysed using Benfords Law are widely categorised and include stock prices, electricity bills, street addresses, population records and geographical and cartographic data.
In this question we will investigate the use of Benfords Law on the distribution of a collection of data sets, including 10,000,000 randomly generated passcodes, to test for fraudulent data.
While there is no one foolproof test that can detect fraudulent data, Benfords Law forms an important component of a suite of tests that can be applied to data and as such is highly useful. We will explore some ways to use Benfords Law in this context.As a matter of interest, the equation for Benfords Law dictating the probability the leading digit d d1,,9 of any number is computed as
Although the equation itself will not be used directly in this assignment, you can make use the probability distribution displayed over:
Figure 1: Benford Law barplot of the percentage for each first digit.The percentages are 30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1 and 4.6 respectively.
To help decide whether a data set can be subjected to Benfords Law, a number of rules have been proposed see Durtschi C, Hillison W, Pacini C 2004 The effective use of Benfords Law to assist in detecting fraud in accounting data. J Forensic Accounting 5: 1734:
Distributions that can be expected to obey Benfords Law:
When the mean is greater than the median and the skew is positive
Numbers that result from mathematical combination of numbers: e.g., quantityprice
Transaction level data: e.g. sales
Distributions that would not be expected to obey Benfords Law:
Where numbers are assigned sequentially: e.g., check numbers, invoice numbers
Where numbers are influenced by human thought: e.g., prices set by psychologicalthresholds 1.99
Accounts with a large number of firmspecific numbers: e.g., accounts set up to record100 refunds
Accounts with a builtin minimum or maximum
These rules are not binding, however they do provide useful guidelines.
We will assume for the purpose of the analysis, that if someone iscooking the books, their data entries would likely be random and that they may not necessarily be following Benfords law.
You are asked to write R code that implements Benfords Law and applies it, as requested below, to the data sets listed, including one you will generate yourself in an attempt to cook the books.
You will need to install the R package benford.analysis and load it using the command
librarybenford.analysis
to answer some of the questions below.The hints below explain how to use this library.
You are asked to write R code that implements Benfords Law and applies it, as requested below, to the data sets listed, including one you will generate yourself in an attempt to cook the books.
You will need to install the R package benford.analysis and load it using the command
librarybenford.analysis
to answer some of the questions below.The hints below explain how to use this library.
Your submitted code should:
Output to the user relevant information about the analysis to follow e.g. an informative message.
Create an R function that computes the mean, median, skewness, kurtosis and Mean Absolute Deviation MAD statistics for any given data set. Be sure that you get the right statistics. There are many different MAD statistics.See the Hints section below.
For each data set below output using your function:
The mean, median, skewness, kurtosis and MAD statistics
A bar chart similar to the green Benfords Law barplot, overlayed with a visual representation of the Benfords Law proportions or percentages if you prefer.
Your report should include for each data set:
Descriptive statistics and your initial thoughts about a possibility of the data to conform to Benfords Law.
Results of Benfords analysis and your conclusion, again with relevant statistics and graphs.
At the end of this question, provide your thoughts about other ways to analyse this data using Benfords law.For example, should you limit yourself to just the first digit? Is there a better way to split up the datasets?In the case of the simulated data set, does the result surprise you?
Hints and Data Sets
Full documentation of the R package benford.analysis can be found at
http:cran.rproject.orgwebpackagesbenford.analysisindex.html
Mean Absolute Deviation MAD: The MAD is calculated by the benford.analysis library as:
where K is the number of unique digits of interest e.g. K9, thus i1,,9 when considering the first digit of a number, AP is the actual proportion of each digit observed while EP is the expected proportion of each digit.The table below is helpful to interpret this statistic:
This abridged table was taken from:Nigrini, Mark. Benfords Law: Applications for Forensic Accounting, Auditing and Fraud Detection. John WileySons, 237 pages.
Data Set 1: Is the corporate payments data set, provided by benford.analysis.To load it, at the command prompt type
datacorporate.payment
The data set contains 189,470 rows of data including the headers and contains payments for the year 2010 for a division of an American utility company. We are interested in the Amountcolumn.
Data Set 2 and 3: Is the expired loan amounts on Kiva.org. The data set contains 13,937 rows of data, including the header names, and can be downloaded from the Data page on the Course Website as the fileallexpiredloans.csv. For this analysis we are interested in the variableloanamount data set 2 and fundedamount data set3.
Data Set 4 Cooking The Books!:You will need to generate your own data set of 5digit pin codes, as a representation of possible pin codes a person might use for their credit card.You will need to generate 1,000,000 5digit pin codes, that is something like 12345 or 11111 or 89898 and so on.In general, a pincode could start from leading zeros, like 00011. However, Benfords Law applies to the first meaningful number, so the last pincode can be treated as 11 only.
Q2 Yelp Reviews 60
Summarising Online Reviews
source: yelp.com
Rate Us On Yelp!
Yelp is an online business that accepts and publishes reviews by anyone and everyone about local businesses, ranging from dentists to hairstylists, hotels and shopping.
A particularly popular business type to review on yelp are restaurants, cafes and other eating establishments.In this question, we are going to analyse a number of yelp reviews for restaurants, cafes, etc., and provide our statistical insights.
Data and Background Information
Yelpers have written over 71 million reviews to date and due to Yelps astounding popularity as a goto site for recommendations or warnings! Yelp has become a very important site, particularly for small businesses who can achieve success or close down, based on their online reviews.
The data for this question and other useful files can be found on the course data page in the zipped file yelp.zip.The main data file is yelpreviews.csv. It contains 1,569,264 rows of data across 12 columns and is a whopping 122.5MB according to my file explorer.This is the data you will analyse.
For your interest, the other data files are as follows.The data was generated using the Python script yelpreviewparse.py and took about 7 hours to run.This script was provided online at this very cool blog site http:minimaxir.com201409onestarfivestars and I have zipped the contents required to run this script, if you are interested although there is no need to run it, since Ive done the hard work!.The zipped file can be downloaded from the course website on the assignment page.
Most of the column names should be selfexplanatory and the user and business ids have been deidentified.The column votesfunny and other similarly named columns contain counts of the number of votes a review received for being funny, useful or cool. The variables poswords and negwords contain a count of the number of positive or negative words found in the review the files positivewords.txt and negativewords.txt contain a list of these words. The final column, netsetiment, is the numerical difference between the counts of poswords and negwords.
Your focus for this question will be on producing appropriate statistical analyses from the data, with a focus on better understanding characteristics of the numerical reviews and in particular, differentiating between 1 and 2star reviews and 5star reviews.
You are asked to write R code that produces the requested output below.
You are asked to write R code that produces the requested output below.
Your submitted code should:
Output to the user relevant information about the analysis to follow e.g. an informative message.
Produce a statistical summary of the data, for the data contained in columns stars, reviewlength, poswords, negwords and netsentiment.Discuss general findings from this statistical summary.Do you have any concerns taking a statistical summary of netsentiment?
Using the table command, produce one table each of the counts of positive words and the counts of negative words.From these tables, produce a plot of your choosing bar, line, points, etc to display the first 20 entries from these tables.You may either produce one plot per table, or one plot overallthis is entirely your choice.Discuss the trends you see in this data.
Now repeat 3. for the data in netsentiment and share any insights you might have about theoutput.Is there any tendency for reviews to be geared towards an overall negative or positive sentiment?
Compute the average review length per star category, summarising this data using the mean, standard deviation, median and IQR and any other information you might think is relevant.See the Hints section below for help with computing these statistics.Produce an appropriate visualisation of your choice e.g. bar plot, line plot, etcof the average either the mean or median to display this summarised information.Explain your choice of average and discuss the general behaviour of the data you have presented.Are positive reviews lengthier than negative reviews, on average?For this question a negative review can be considered to be a 1 or 2star review and a positive review a 4 or 5 star review.
Determine the business which yielded the worst and best ratings and determinethe users who gave the worst and best ratings.In both cases,ifthere are ties for best andor worst, note how many tied for in each category and discuss any relevant information obtained from this analysis.
Your report should include:
An introduction describing the data and business case.
Results and brief discussion for the listed above analysis.
Conclusion.
Hints
In R you can easily group data using the groupby function.To access this function youll need to install the dplyr package and load it using the library command.You can summarise the groups produced by the groupby function, using the summarise function. E.g. if you have groups stored in grp, then the code below will produced numerical summaries as indicated:
grpmnsummarisegrp,meanmeanreviewlength, sdsdreviewlength
to plot the summaries, you would plot, e.g. grpmnmean.
Reviews
There are no reviews yet.