SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTING
BIG DATA & DATA ANALYTICS LAB PROJECT 2
This lab project is based on a dataset about movie success in 2014 and 2015 by Ahmad et al. (2015) which is available on the online platform by Lichman et al (2013). Download the file movidata.csv from Blackboard and then complete the following exercises.
EXERCISE 1 (1 MARK)
Use ggplot() to create a box plot that shows the number of screens on which each movie was initially launched in the US on the y-axis separately for 2014 and 2015. Note: Only include those observations that do not have a missing value (NA) for the variable screens (e.g., by using !is.na()).
EXERCISE 2 (2 MARK)
[R-CODE]
[R-CODE]
Calculate the profit of each movie (profit = gross
budget) and add the results as a new variable profit
to the moviedata dataframe. Use ggplot() to create a
violin plot that shows the profit on the y-axis separately
for ORIGINAL movies and SEQUEL movies (using the
sequelcat variable). Use the YlOrRd colour palette
from the RColorBrewer library to fill the violin plots
(hint for spelling: YlOrRd stands for Yellow / Orange /
Red). Add a boxplot on top of the violin plot and add a red point that indicates the mean value. Note: Only include those observations that do not have a missing value (NA) for the variable profit.
EXERCISE 3 (1 MARK) [R-CODE]
Use the subset() command to create a subset of the dataframe that only includes observations without missing values for budget, screens, and aggregate_followers. Name this data frame moviedatasub. Then, using the newly created data frame moviedatasub , use the custom winsor() function discussed in the lecture slides in week 3 to create a new variables likes_winsor based on the variable likes. Use a multiplier of 1.5.
To make sure that the winsorising worked, compare the two variables by creating simple box plots using the following commands.
with(moviedatasub, boxplot(likes)) with(moviedatasub, boxplot(likes_winsor))
1/3
EXERCISE 4 (2 MARKS)
Look up the cut command. Based on the dataset moviedatasub, create a new column ratingscat in the dataframe that describes the ratings category of a movie using the cut command. Distinguish between the following categories:
negative (0 rating < 6)- neutral (6 rating < 6.8)- positive (6.8 rating < 10)Use ggplot() to create a scatterplot for gross overlikes_winsor that you created in Exercise 3. Indicate thedifferent ratings categories by colouring the points in the scatterplot with the “FantasticFox1” color palette of the “wesanderson” library package.EXERCISE 5 (1 MARK) [R-CODE]Based on the dataset moviedatasub, use the ddply() function of the package plyr to create a data frame with the means and standard deviations of profit, gross, and budget for the three different ratings categories (variable: ratingscat, cf. Exercise 4) and for the two different values of sequelcat (ORIGINAL / SEQUEL). Also include the number of observations N for each of the category combinations. The output should look like this:EXERCISE 6 (2 MARKS) [R-CODE]Based on the dataset moviedatasub, use a Bartletts test to test for variance homogeneity in the variable profit across the three different ratings categories (variable: ratingscat, cf. Exercise 4). In your own words, interpret the results of the test and decide whether we should assume that the variances are homogeneous.Then, use a one-way Analysis of Variance (ANOVA) to test whether there is a difference in mean profit across the three different ratings categories and interpret the result in your own words. Conduct a PostHoc analysis to determine which groups are significantly different from each other. How does the result of the test of variance homogeneity affect the PostHoc analysis?EXERCISE 7 (1 MARKS) [R-CODE]Based on the dataset moviedatasub, compare the mean profits for ORIGINAL and SEQUEL movies (variable: sequelcat). Which test should we use to test whether there is a significant difference and why? Conduct the test in R and interpret the result in your own words.[R-CODE]2/3REFERENCESAhmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I. Using Crowd-source based features from social media and Conventional features to predict the movies popularity. In Smart City/ SocialCom/S ustainCom (SmartCity), 2015 IEEE International Conference on 2015 Dec 19 (pp. 273-278). IEEE. https://ieeexplore.ieee.org/document/7463737Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.DATASETmoviedata Conventional and Social Media Movies 2014 and 2015 DescriptionA dataset about the success of movies in 2014 and 2015.UsagemoviedataFormatA data frame with 231 observations on the following 14 variables. movie year ratings genre gross budget screenssequel dummy_sequel sentimentviewslikesdislikescomments aggregate_followersSourceName of the movieYear of movie releaseRating of the movie (0 10)Identifier for the genre of the movie (e.g., action, adventure, drama) Gross world-wide income from the movie (in US$)Budget for the movieNumber of screens that the movie was initially launched in on the opening weekend in the USA number indicating whether the movie is sequel or original (individual) movie, where higher numbers indicate later sequels in a series. For instance, for Mission Impossible a sequel value of 5 indicates that this is the fifth movie in the series.0 Original movie1 Sequel movieA sentiment score assessed through an analysis of tweets about the movie on Twitter. 0 represents a neutral sentiment, a positive value represents a positive sentiment, and a negative value indicates a negative sentiment. The sentiment score for each movie was calculated by retrieving all tweets related to each movie, assigning the sentiment score to each of them and then aggregating the score. Number of times the movie trailer was viewed on YouTube Number of likes the movie trailer received on YouTubeNumber of dislikes the movie trailer received on YouTubeNumber of times the movie trailer received a comment on YouTube The aggregate number of actor followers: Equal to sum of followers of top 3 cast from TwitterAhmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I. Using Crowd-source based features from social media and Conventional features to predict the movies popularity. In Smart City/ SocialCom/S ustainCom (SmartCity), 2015 IEEE International Conference on 2015 Dec 19 (pp. 273-278). IEEE. https://ieeexplore.ieee.org/document/7463737Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.3/3
Reviews
There are no reviews yet.