Homework 2: Mining Time Series – do the top 5 countries with the most cumulative
COVID-19 cases demonstrate similar patterns?Summary
We will continue to explore the data we used in the Time Series lab from the Johns Hopkins
University CSSE COVID-19 dataset. However, this time, we are interested in the number of
daily new cases exclusively from the top 5 countries that have the most cumulative cases
as of August 21, 2020.To explore and analyze this dataset, this assignment will focus on extracting the seasonal
component from the countries’ time series, computing the similarity between them, and
calculating the Dynamic Time Warping (DTW) Cost.Data
For this assignment, we will be reusing the time_series_covid19_confirmed_global.csv file
from the Time Series lab.Packages
We recommend using the following Python packages in this assignment:
● numpy
● pandas
● matplotlib
● statsmodels
● mathAssignment Structure
This homework is divided into the following parts:
Part 1: Load & Transform the Data
Part 2: Extract Seasonal Components
Part 3: Time Series Similarities
Part 4: Dynamic Time Warping (DTW) Costa) [15 points] To begin, create a function called `load_data` that reads in the csv file and
produces a `pd.DataFrame` that looks like:
where
● the index of the DataFrame is a `pd.DatetimeIndex`;
● the column names “?” are the top 5 countries with the most cumulative cases as of
August 21, 2020, sorted in descending order from left to right;● the values of the DataFrame are daily new cases; and
● the DataFrame doesn’t contain any `NaN` values.
This function should return a `pd.DataFrame` of shape (212, 5), whose index is a
`pd.DatetimeIndex` and whose column labels are the top 5 countries.b) [5 points] Then, using your newly created ‘load_data’ function, plot one line for each country that
is in the top 5 for most cumulative cases where the x-axis is the date and the y-axis is the number of
cases. Please do so within one figure.Recall from lecture and lab that an additive Seasonal Decomposition decomposes a time series
into the following components:
Y(t) = T(t) + S(t) + R(t)
where T(t) represents trends, S(t) represents seasonal patterns and R(t) represents residuals. In
the rest of the assignment, we will work with the seasonal component S(t) to understand the
similarities among the seasonal patterns of the five time series we have, so let’s write a function
that extracts this very seasonal component.a) [10 points] Complete a function, ‘sea_decomp’, that accepts a `pd.DataFrame` and returns
another `pd.DataFrame` of the same shape that looks like:
where
● the index of the DataFrame is a `pd.DatetimeIndex`;
● the column names “?” are the top 5 countries with the most cumulative cases as of
August 21, 2020, sorted in descending order from left to right;● the values of the DataFrame are the seasonal components S(t) as returned by the
`seasonal_decompose` function from the `statsmodels` package; and
● the DataFrame doesn’t contain any `NaN` values.
This function should return a `pd.DataFrame` of shape (len(df), 5), whose index is a
`pd.DatetimeIndex` and whose column labels are the top 5 countries.b) [5 points] Then, using this function, please plot one line for each country in the top 5
showing the seasonal component – you should have a total of 5 line graphs where the x-axis is
the date and the y-axis is the seasonal component.3.1 Euclidean Distance [20 points]
Now, we may start to ask questions like, “which country in the top 5 countries are the most
similar to Country A in terms of seasonal patterns?”. In addition to the seasonal components
that reflect seasonal patterns, we also need a measure of similarity between two time series in
order to answer questions like this. One of such measures is the good old Euclidean Distance.
Recall that the Euclidean Distance between two vectors x and y is the length of the vector x – y:a) [15 points] Complete a function, ‘calc_euclidean_dist’, that accepts a `pd.DataFrame`,
whose columns are time series for each country, and that returns all pairwise Euclidean
Distance among these time series, similar to the following:
where
● the index and the column names “?” are the top 5 countries with the most cumulative
cases as of August 21, 2020, sorted in descending order from top to bottom and from left
to right; and
● the values of the DataFrame are pairwise Euclidean Distance, for example,
`233760.757213` is the Euclidean Distance between the time series of the Rank 1
country and the Rank 2 countryThis function should return a `pd.DataFrame` of shape (5, 5) whose index and column
labels are the top 5 countries.
b) [5 points] Then, use this new function to calculate the pairwise Euclidean Distance matrix for
the extracted seasonal components from the top 5 countries with the most cumulative cases.3.2 Cosine Similarity [20 points]
Another commonly used similarity measure is the Cosine Similarity. Recall that the Cosine
Similarity between two vectors x and y is the cosine of the angle between x and y:
a) [15 points] Complete a function, ‘calc_cos_sim’, that accepts a `pd.DataFrame`, whose
columns are the time series for each country, and that returns all pairwise Cosine Similarity
among these time series, similar to the following:
where● the index and the column names “?” are the top 5 countries with the most cumulative
cases as of August 21, 2020, sorted in descending order from top to bottom and from left
to right; and
● the values of the DataFrame are pairwise Cosine Similarity, for example, `0.898664` is
the Cosine Similarity between the time series of the Rank 1 country and the Rank 2
country
This function should return a `pd.DataFrame` of shape (5, 5), whose index and column
labels are the top 5 countries.b) [5 points] Now, use this new function to calculate the pairwise Cosine Similarity between
seasonal patterns.4.1 Define a Function to Calculate DTW Cost [10 points]
Last but not least, the cost of aligning two time series can also be used as a similarity measure.
Two time series are more similar if it incurs less cost to align them. One of the commonly used
alignment costs is the Dynamic Time Warping (DTW) cost, which we will explore in this problem.
Recall from lecture that the DTW cost is defined by the following recursive relations:
where we define d(xi , yj ) = (xi , yj )
2
.a) [10 points] With reference to the demo of the DTW algorithm in the lecture slides, implement
a function, ‘calc_pairwise_dtw_cost’, below that computes the DTW cost for two time series.
We don’t take the square root of the results just yet, until later when we compare the
DTW costs with the Euclidean Distance.
This function should EITHER return a `np.ndarray` of shape (len(y), len(x)) which
represents the DTW cost matrix, OR a single `float` that represents the overall DTW cost,
depending whether the parameter `ret_matrix=True`.4.2 Compute Pairwise DTW Cost [15 points]
Now let’s compute all pairwise DTW costs for our five time series.
a) [10 points] Implement a function, ‘calc_dtw_cost’, below that accepts a `pd.DataFrame`,
whose columns are the time series for each country, and that returns all pairwise DTW costs
among these time series, similar to the following:
where● the index and the column names “?” are the top 5 countries with the most cumulative
cases as of August 21, 2020, sorted in descending order from top to bottom and from left
to right; and
● the values of the DataFrame are pairwise DTW costs, for example, `9.575974e+09` is
the DTW cost between the time series of the Rank 1 country and the Rank 2 country
This function should return a `pd.DataFrame` of shape (5, 5), whose index and column
labels are the top 5 countries.b) [5 points] Now, use this function to calculate the pairwise DTW costs between seasonal
patterns. Please take the square root so that we can compare it with the Euclidean
Distance.
What can you say about the similarities among these seasonal patterns? Do the results
of the pairwise Euclidean Distance, Cosine Similarity and DTW Cost calculations tell the
same story?Submission
All submissions should be made electronically
Here are the main deliverables:
● A PDF version of your executed Jupyter Notebook
● The actual Jupyter notebook, so that we can check your results
Please make sure to provide appropriate conclusions drawn from the code/results throughout
the notebook.
671/721, Homework, mining, series., solved, Time
[SOLVED] Si 671/721 homework 2: mining time series
$25
File Name: Si_671_721__homework_2__mining_time_series.zip
File Size: 395.64 KB
Reviews
There are no reviews yet.