[SOLVED] C python FIT5196-S1-2019 assessment 3

$25

File Name: C_python_FIT5196-S1-2019_assessment_3.zip
File Size: 348.54 KB

5/5 - (1 vote)

FIT5196-S1-2019 assessment 3
This is an individual assessment and worth 30% of your total mark for FIT5196.
Due date: 11:55 PM, Wednesday, June 12, 2019
For this assessment, you are required to write Python (Python 2/3) code to integrate several datasets into one single schema and find and fix possible problems in the data. Input and output of this assessment are shown below:
Table 1. The input and output of the task
Inputs
Output
Jupyter notebook
vic_suburb_boundary.zip, gtfs.zip Crimebylocation.xlsx .csv council.txt
_solution.csv
_ass3.ipynb
You are given multiple datasets in various formats and the task is about creating housing information in Victoria, Australia. Your assessment is to perform the following tasks.
Task 1: Data Integration (60%)
In this task, you are required to integrate these datasets into one with the following schema.
Table 2. Description of the final schema
COLUMN
DESCRIPTION
ID
A unique id for the property
Address
The property address
Suburb (20/100)
The property suburb. The suburb must only be calculated using Vic_suburb_boundary.zip. Default value: not available
Price
The property price
Type
The type of property
Date
Date of sold

Rooms
Number of bedrooms
Bathroom
Number of bathrooms
Car
The number of parking space of the property
LandSize
The area of the property
Age
The age of the property at the time of selling
Latitude
The Latitude of the property
Longitude
The Longitude of the property
train_station_id (15/100)
The closest train station to the property that has a direct trip to the Southern Cross Railway Station. A direct trip is a trip that there are no connections (transfers) in the trip from the origin to the destination. Default value: 0
distance_to_train_stat ion (5/100)
The direct distance from the closest train station to the property that has a direct trip to the Southern Cross Railway Station. Default value: 0
travel_min_to_CBD (20/100)
The average travel time (minutes) from the closest train station (regional/metropolitan) that has a direct trip to the Southern Cross Railway Station on weekdays (i.e. Monday-Friday) departing between 7 to 9:30 am. For example, if there are 3 direct trips departing from the closest train station to the Southern Cross Railway Station on weekdays between 7-9:30 am and each takes 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3.). Default value: 0
over_priced? (10/100)
A boolean feature indicating whether or not the price of the property is higher than the median price of the similar properties (with respect to bedrooms, bathrooms, parking_space, and property_type attributes) in the same suburb on the year of selling. Default value: -1
crime_A_average (7/100)
The average of type A crime for three years prior to selling in the local government area of the property as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type A for 2013, 2014 and 2015. Default value: -1
crime_B_average (7/100)
The average of type B crime for three years prior to selling in the local government area as the property. For example, if a property is sold in 2016, then you should calculate the average of

the crime type B crime for 2013, 2014 and 2015. Default value: -1
crime_C_average (6/100)
The average of type C crime for three years prior to selling in the local government area as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type C for 2013, 2014 and 2015. Default value: -1
Task 2: data reshaping (15%)
In this task, you need to study the effect of different normalization/transformation methods (i.e. standardization, min-max normalization, log, power, and root transformation) on Rooms, crime_C_average, travel_min_to_CBD, and property_age attributes. You need to observe and explain their effect assuming that we want to build a linear model on price using these attributes as the predictors of the linear model and recommend which one(s) do you think would work better on this data. When building the linear model, the same normalization/transformation method can be applied to each of these attributes.
Task 3: Documentation and Methodology (25%)
The main focus on the documentation would be on the quality of your explanation on finishing these tasks. Your notebook file should be on a decent format with proper sections and subsections.
Note 1: the output csv file must have the exact same columns as specified on the schema. If you decide not to calculate any of the required attributes, then you must have a column for that attribute in your final data-frame with the default value as the value of all the rows. Please note that output file which is not in a correct format, as specified in the integrated schema, wont be marked.
Note 2: the radius of the earth is still 6378 km!
Note 3: In table 2, numbers in front of some of the rows in the format of (a/b) are the allocated mark associated with that attribute. For example, the suburb attribute carries 20% of the total mark of task 1. Please note that 10% of the total marks for task 1 is marked on any other issue that may occur during the data integration process.
Note 4: You can only use the vic_suburb_boundary.zip file to extract the suburb name of the property. Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute).
Note 4: for more info about GTFS data please visit here , here, and here.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] C python FIT5196-S1-2019 assessment 3
$25