5/5 - (1 vote)

Machine Learning for Financial Data
December 2020
FEATURE ENGINEERING (CONCEPTS PART 1)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Feature Engineering
Contents
Financial Data Sources
What is Feature Engineering
Feature Understanding
Feature Improvement

Financial Data Source Yahoo Finance

Yahoo Finance is one of the reliable sources of stock market data
Yahoo Finance (hk.finance.yahoo.com) supports market summaries, historical & current quotes, news feed about companies
Historical & current stock prices in different frequencies (daily, weekly, monthly)
Calculated metrics
e.g., the beta, a measure of the volatility of an individual asset in comparison to the volatility of the entire market
Financial data of a company since its listing in the stock market
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Machine Learning Models

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Feature Engineering
Adjusted closing price adjusted for both dividends and splits

Python: Programmatic Access to Financial Data
# display the output of plotting commands inline
# use the retina display mode, i.e. to render higher resolution images
%matplotlib inline
%config InlineBackend.figure_format = retina
import matplotlib.pyplot as plt import warnings
# import the plotting module of the matplotlib package and binds it to the name plt # display all warnings
# customize the display style
# set the dots per inch (dpi) from the default 100 to 300 # suppress warnings related to future versions
plt.style.use(seaborn)
plt.rcParams[figure.dpi] = 300 warnings.simplefilter(action=ignore, category=FutureWarning)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 8
Feature Engineering

Python: Downloading Data as DataFrame
# import the relevant packages
import pandas as pd import yfinance as yf
# download the data data since 1950
# use MLCO as the ticker of Melco Resorts & Entertainment
# disable the showing of the progress bar using progress=False
data = yf.download(MLCO, start=2010-01-01,
end=2020-12-31, progress=False)
# inspect the data using formatted print
print(fDownloaded {data.shape[0]} rows of data.) data.head()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Feature Engineering

Financial Data Source Quandl

Quandl is a provider of alternative data products for investment professionals
Quandl delivers market data from hundreds of sources via API, or directly into Python, R, Excel, and many other tools
Featured data includes
End of Day US Stock Prices, Core US Fundamentals Data, US Equity Historical & Option Implied Volatilities, Continuous Futures, Trading Economics, BNC Digital Currency Indexed EOD, Global Fundamentals Data, Global Index Prices
Before downloading data, create an account (https://www.quandl.com)
Obtain the API key in the profile (https://www.quandl.com/account/profile)
Search data function (https://www.quandl.com/search)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Machine Learning Models

Create a Quandl account
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Machine Learning Models

Python: Programmatic Data Access
# import the relevant packages
import pandas as pd import quandl as qd
start=2010-01-01, end=2020-12-31) print(fDownloaded {data.shape[0]} rows of data.)
data.head()
# use the API key generated during account creation # provide the DATASET/TICKER of the dataset
qd.ApiConfig.api_key = GEM..xwB data = qd.get(HKEX/06883,
# inspect the data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Feature Engineering

What is Feature Engineering

Feature Engineering
Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Feature Engineering

Feature engineering is about making data meaningful to the machine learning model
Raw and Partially Processed Data
Feature engineering can be applied to data at any stage and deals with raw & partially processed data typically in the form of observations (rows) and attributes (columns).
Meaningful Features
A feature is an attribute of data that is meaningful to the machine learning process. Some attributes can be unhelpful or even hurtful to the machine learning process.
Better Representation
Data always serve to represent a specific problem in a specific domain.
The rationale is to transform data so that it better represents the bigger problem at hand.
Model Performance Improvement
The eventual goal of feature engineering is to obtain data that the learning algorithms will be able to extract patterns from and use in order to obtain better results.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Feature Engineering

Raw data are often in a state that cannot be directly consumed by machine learning algorithms
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Membership

Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Silver

Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Silver

Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Bronze

Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Gold

Adams
20000
7-Oct-2020
AUS
JAP
58
Primary
No
Silver

Jones
3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Silver

Mary
8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Gold

Max
7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Premium

Peter
500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Bronze

Anson
7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Gold

Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
21
Feature Engineering
Feature
Target

Feature engineering can be carried out in steps but different schools have different thoughts on structuring the steps
Feature Understanding Feature Improvement
Feature Selection Feature Construction
Feature Transformation
Feature Learning
interpreting the data and identify its qualitative and quantitative states cleaning data values and refilling missing data values
selecting features to reduce the noise in the dataset
encoding features and creating new features via feature interactions changing the dataset fundamental structure to prioritize impactful features deploying deep learning to help identify the key features in the dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22
Feature Engineering

Feature Understanding

Correctly identifying numerical & categorical variables involves looking at the data types & inspecting their values
ALL
QUALITATIVE / CATEGORICAL NOMINAL ORDINAL
QUANTITATIVE / NUMERICAL INTERVAL RATIO
UNSTRUCTURED
transform
STRUCTURED
Values are selected from a group of categories, also called labels
Usually of type object or string
The set of possible values is finite
Generally whole numbers (e.g. 1, 2, )
Usually of type int
Examples: number of children, number of pets
Values may take any number within a range
Usually of type float
Examples: price of a product, income, house price, or interest rate
Examples: gender (i.e. male and female)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Feature Engineering
Examples: school grades (e.g. A+, A, )

Understanding the Structure of Data

Structured Data
Any data that can be stored, accessed, and processed in the form of fixed format
The format is known in advance
e.g. data stored in a relational database
Also referred to as Schema-on-Write
Semi-Structured Data
Semi-structured data has a lack of fixed, rigid structure
There is no separation between the data and the schema a self- describing structure
e.g. XML files, JSON files, web pages in HTML, RDF files
Unstructured Data
Any data with unknown form
e.g. heterogeneous data sources containing simple text files, images & videos
Also referred to as Schema-on-Read
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
26
Feature Engineering

Data in a relational database or an Excel spreadsheet are considered structured data
Data in a Relational Database Data in an Excel Spreadsheet
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Feature Engineering

Semi-structured data embeds information about the data structure and the data contents in the same document

Your Title Here

Link Name
is a link to another nifty site

This is a Header

Name: [SOLVED] CS algorithm data structure python database deep learning Hive finance chain Excel Machine Learning for Financial Data
Brand: Assignment Chef
SKU: 5094671217
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

This is a Medium Header

Send me mail at [email protected].

This is a new paragraph!

This is a new paragraph!

This is a new sentence without a paragraph break, in bold italics.

Web Page in HTML format Data-Value Pairs in JSON format
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 28
Feature Engineering
{
quiz: {
sport: { q1: {
question: Which one is correct team name in NBA?, options: [
New York Bulls,
Huston Rocket ],
answer: Huston Rocket }
}, }
}

Unstructured data are not really unstructured but structured in a way that is less convenient to manipulate
Image in JPEG format
Audio in WAVE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Feature Engineering

Most unstructured data can be transformed into structured
data through a few manipulations
Images are represented as structured data in the form of layers of matrices containing color intensity value for each pixel
matrix
Images are
considered unstructured data
Images can be decomposed into 3 color channels
red
green
blue
transformation
row
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
30
Feature Engineering

Training autonomous cars using semantic segmentation of road scenes
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Feature Engineering

CSV (Comma- Separated Values) Format
Text File
Used to represent tabular structure
Each line is a record
Each record has multiple columns separated by comma
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32
Feature Engineering

Feature Improvement

Feature improvement is about altering data values and removing dataset columns/rows
Feature improvement involves both feature cleaning and removal Cleaning alters columns and rows in the dataset
Removal takes columns and rows away from the dataset
Possible actions include
Identifying missing values
Removing harmful data
Imputing (filling in) missing values
Normalizing/standardizing data
Z-score normalization, min-max scaling, L1 & L2 normalization
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Feature Engineering

Feature understanding focuses on data values and data value induced structural changes
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
20000
7-Oct-2020
AUS
JAP
58
Primary
No

Jones
3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
35
Feature Engineering
Feature
Target

Missing data is a common problem in datasets and needs to be dealt with before applying any ML model
Missing data refers to the absence of values for certain observations and is an unavoidable problem in most data sources
e.g. with survey data, some observations may not have been recorded
Scikit-learn does not support missing values as input, so it is necessary to take
one of the following actions
remove observations with missing data transform them into permitted values
The goal of any imputation technique is to clean the data to produce a complete dataset that can be used to train ML models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Feature Engineering

Data Type Rectification

Data Type Rectification
When processing data using Python dataframes, some type mismatch might occur during the loading of data
Normal practice is to store
Discrete variables as the int type
Continuous variables as the float type
Categorical variables as the object type
However, discrete variables can sometime be cast (loaded) as float
To correctly identify variable types, both data types and data values need to be inspected
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
38
Feature Engineering

The Date values might be load as string type but would be more appropriate to be of datetime type
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
20000
7-Oct-2020
AUS
JAP
58
Primary
No

Missing Value Removal

Complete Case Analysis (CCA)
Discarding those observations where the values in any of the variables are missing
Can be applied to categorical and numerical variables
Preserves the distribution of the variables, provided the data is missing at random and only a small proportion of the data is missing
However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
41
Feature Engineering

Missing value removal involves the removal of an entire observation containing the missing value from the dataset
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
20000
7-Oct-2020
AUS
JAP
58
Primary
No

Data Imputation

Imputation
Imputation is the replacement of missing values with statistical estimates of the missing values
There are multiple imputation techniques that can be deployed
The choice of imputation technique will depend on
whether the data is missing at random
the number of missing values
the machine learning model to use
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
44
Feature Engineering

Imputation techniques vary between numerical variables and categorical variables
Numerical Variable
Mean / Median imputation
Arbitrary number imputation
End of distribution imputation
Random sampling imputation
Missing value indicator augmentation
Multivariable imputation using chained equations
Categorical Variable
Mode imputation
Random sampling imputation
Bespoken category imputation
Missing value indicator augmentation
Multivariable imputation using chained equations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
45
Feature Engineering

Mean or Median Imputation
Replacing missing values with variable mean or median
Can only be performed in numerical variables
The mean / median is calculated using a training dataset and are used to impute missing data in training, testing, and future datasets
Use mean imputation if variables are normally distributed and median imputation otherwise
Mean and median imputation may distort the distribution of the original variables if there is a high percentage of missing data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
46
Feature Engineering

Missing values can be replaced with the mean or median of the non-missing values of the feature
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
20000
7-Oct-2020
AUS
JAP
58
Primary
No

Jones
3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
mean / median
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
47
Feature Engineering
Feature
Target

The choice of removal or imputation technique is determined by the superiority of model accuracy
Imputation Technique
# of rows in the training dataset
Accuracy
1
3
4
Dropping rows with missing values
Imputing missing values with zero
Imputing missing values with the mean
Imputing missing values with the median
392
768
768
769
0.74489
2
0.7304
0.7318
0.7357
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48
Feature Engineering

References

References
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Understanding Machine Learning
Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists
Alice Zhang & Amanda Casari OReilly Media, April 2018 ISBN-13: 978-1-491-95324-2

Feature Engineering
DataTypesinStatistics(https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/) TypesofData&MeasurementScales:Nominal,Ordinal,IntervalandRatio
(https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/)
MeasuresofCentralTendency(https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php)
ScalesofMeasurementandPresentationofStatisticalData(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206790/)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51
Feature Engineering

Financial Datasets
yfinance0.1.54,2019 https://pypi.org/project/yfinance/
quandl/quandl-python,2019 https://github.com/quandl/quandl-python
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Feature Engineering

Public Datasets
AcademicTorrents(https://academictorrents.com/browse.php?cat=6)
AwesomePublicDatasets(https://github.com/awesomedata/awesome-public-datasets)
Awesome JSON Datasets (https://github.com/jdorfman/awesome-json-datasets)
CommonCrawl(http://commoncrawl.org/the-data/)
DataHubDatasets(https://datahub.io/search)
KaggleDatasets(https://www.kaggle.com/datasets)
GitHubArchive(http://www.gharchive.org/)
GitHubCOCO-StuffDatasets(https://github.com/nightrome/cocostuff)
HarvardResourcesforCOVID-19(https://dataverse.harvard.edu/dataverse/2019ncov)
GitHubCOVID-19Data(https://github.com/owid/covid-19-data/tree/master/public/data/)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Feature Engineering

Public Datasets
CoronavirusSourceData(https://ourworldindata.org/coronavirus-source-data)
OCHANovelCoronavirus(COVID-19)CasesData(https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-
cases)
WorldBankOpenData(https://data.worldbank.org/)
HongKongGovernmentOpenData(https://data.gov.hk/en/)
USGovernmentOpenData(https://www.data.gov/open-gov/)
TaiwanGovernmentOpenData(https://data.gov.tw/)
Dataquest18PlacestoFindFreeDataSetsforDataScienceProjects(https://www.dataquest.io/blog/free-datasets- for-projects/)
GoogleDatasetSearch(https://datasetsearch.research.google.com/)
UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php)
KDnuggetsDatasetsforDataMiningandDataScience(https://www.kdnuggets.com/datasets/index.html) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Feature Engineering

Public Datasets
GoogleBigQueryPublicDatasets(https://cloud.google.com/bigquery/public-data/)
GoogleResearchDatasets(https://research.google/tools/datasets/)
MicrosoftPublicDataSetsforTestingandPrototyping(https://docs.microsoft.com/en-us/azure/sql-database/sql- database-public-data-sets)
AmazonWebServicesRegistryofOpenData(https://registry.opendata.aws/)
PathmindOpenDatasets(https://pathmind.com/wiki/open-datasets)
Lionbridge15BestAudioDatasetsforMachineLearning(https://lionbridge.ai/datasets/12-best-audio-datasets-for- machine-learning/)
GoogleCOVID-19PublicDatasets(https://console.cloud.google.com/marketplace/details/bigquery-public- datasets/covid19-dataset-list?preview=bigquery-public-datasets)
GooglePublicData(https://www.google.com/publicdata/directory?hl=en_US&dl=en_US#!)
Tableau Coronavirus (COVID-19) Global Data Tracker (https://www.tableau.com/covid-19-coronavirus-data-resources)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Feature Engineering

THANK YOU

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS algorithm data structure python database deep learning Hive finance chain Excel Machine Learning for Financial Data

This is a Header

This is a Medium Header

Reviews

Related products

[Solved] Python program to manage information about baseball players

[Solved] Python Program 8 solved

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[Solved] Python program that plays a guess-the-number game with magic lists

[Solved] Python program to figure out if it is better to pay off your loans or pay off the minimal

[Solved] List Maintainer