Hand in Requirements:
FIT1043 Assignment 3 Semester 2, 2019
1 Please hand in a PDF file containing your answers to all the questions and, numbered correspondingly.
2 Your report should include the following cases:
The screenshotsimages of the outputsgraphs you generate in order to
justify your answers to all the questions.
Copies of all the bash command lines and R scripts you use. If your answer
is wrong, you may still get half marks if your command line or script is close
to correct.
3 Please be informed that you need to explain what each part of command does for all your answers. For instance, if the code you use is unzip tutorialdata.zip, you need to explain that the code is used to uncompress the zip file.
4 Please dont include the questions into the assignment It has 5 penalty.
Data:
The dataset for this assignment is in the Google shared drive:
https:drive.google.comfiled15icdh6UTmJ7OJ1ytMk gp0OT4G8ZnqSmview?uspsharing
The dataset contains Facebook posts from 15 of the top mainstream media sources e.g., ABC, BBC, etc. from 2012 to 2016.
Note: This is a large file, so your best bet is to download them while in the labstudio and do the assignment there. You will need to use either a Linux machine for this or a Mac terminal or Cygwin on a Windows machine.
Assignment Tasks:
There are two tasks that you need to complete for this assignment. Students that complete only Tasks A1A9 and B1 can only get a maximum of Distinction. Students that attempt tasks A10 and B2 Challenge Questions will be showing critical analysis skills and a deeper understanding of the task at hand and can achieve the highest grade. You need to use the Unix shell and R to complete the tasks.
Task A: Investigating Facebook Data using shell commands
Download the file FBDataset.csv.zip from the link above. Use a Unix shell to manipulate the file and answer the following questions.
1 Decompress the file. How big is it?
2 What delimiter is used to separate the columns in the file and how many columns are there?
3 The2ndcolumnistheuniqueidentifierforaFacebookpost.Printoutthename of other columns in the output?
4 How many unique pages are there?
5 WhatisthedaterangeforFacebookpostsinthisfile?Assumethatthedatais in order
6 When was the first mention in the file regarding Italian Dishes and what was the post name?
7 How many times is Donald Trump mentioned in the file? How did you find this? Do not ignore the case, i.e., lowerupper case
8 What about Barack Obama? Who is more popular on Facebook, Obama or Trump? Do not ignore the case
9 Select the posts where Trump Ignore the case is mentioned in the post content and number of likes for those posts are greater than 100. And generate a new file with postid and sorted likecount and name it trump.txt. You need to add a screenshot of the first 5 rows and the column headers in your report.
10 Challenge: Find the total number of lovecount and angrycount for Donald Trump and Barack Obama separately. Who has more positive feeling among people? Justify your answer.
Hint 1: you will need to search online to find how to sum a column of numbers using awk.
Hint 2: You will need to consider both love and angry count when justifying your answer.
Task B: Graphing the Data in R
1 We want to consider how the amount of discussion regarding Donald Trump varies over the time period covered by the data file. To answer this question, you will need to extract the timestamps for all posts referring to Trump using the shell. You will then need to read them into R and generate a histogram. Hint: To read the data into R, first generate a file containing only the timestamp column as text. Then read the file into R as a CSV. R will not recognise the strings as timestamps automatically, so youll need to convert them from text values using the strptime function. Instructions on how to use the function is available here: https:www.rdocumentation.orgpackagesbaseversions3.6.1topicsstrptime You will need to write a format string, starting with a b to tell the function how to parse the particular datetime format in your file. What format string do you need to use?
A. Once you have converted the timestamps, use the hist function to plot the data in R.
B. The plot has a bit of an unusual shape. Describe the pattern you see.
2 Challenge: In this question, we want to look at a specific content type that influences engagement on Facebook. To make this task easier, we will specifically look at the number of comments posted against each of the post type event, link, photo, status and video for foxnews.
A. Draw a boxplot to show the distribution of comments made against each type of post event, link, photo, status and video created by fox news. What can you infer from this plot? Which is the most engaging post type?
B. You may have noticed that the presence of outliers affects the readability and interpretation of the data in the box plot. Redraw the boxplot by filtering out values commentscount greater than 10,000.
C. Which type of post event, link, photo, status or video has on average been most effective for foxnews. In other words, which posttype has the highest median comment count.
Good Luck!
Reviews
There are no reviews yet.