Lead Research Scientist, Financial Risk Quantitative Research, SS&C Algorithmics Adjunct Professor, University of Toronto
MIE1624H Introduction to Data Science and Analytics Lecture 1 Introduction
University of Toronto January 11, 2022
Copyright By Assignmentchef assignmentchef
Lead Research Scientist, Financial Risk Quantitative Research at SS&C Algorithmics, formerly with Watson Financial Services, IBM
Ph.D. in Computer Science from McMaster University
Author of over 20 papers and reports
Adjunct professor at University of Toronto and lecturer at McMaster University
Research areas:
business analytics, operational research, optimization, finance portfolio optimization, multi-objective optimization
market and credit risk modeling and optimization
numerical methods for risk management
design of numerical algorithms and their software implementation
Profession
Choose a job you love,
and you will never have to
work a day in your life.
The only way to do great work is to love what you do. If you havent found it yet, keep looking. Dont settle.
Data Science
Machine Learning
Data Science
Business & Domain Expertise
Analytics
What is analytics?
Analytics is the scientific process of deriving insights from
data in order to make decisions Analyze
Descriptive Analytics
What has happened?
Predictive Analytics
What will happen?
Prescriptive Analytics and Artificial Intelligence
What should we do?
Business Value
Operations research
Operations Research (O.R.) is the discipline of applying advanced analytical methods to help make better decisions
Analytical techniques:
Simulation giving you the ability to try out approaches and test
ideas for improvement
Optimization narrowing your choices to the very best when there are virtually innumerable feasible options and comparing them is difficult
Probability and Statistics helping you measure risk, mine data to find valuable connections and insights, test conclusions, and make
reliable forecasts
Mathematical Modeling algorithms and software
Our planet is a complex, dynamic, highly interconnected $54 Trillion system-of-systems (OECD-based analysis)
This chart shows systems (not industries)
Communication
Transportation
Leisure / Recreation / Clothing
Electricity
Global system-of-systems
$54 Trillion
(100% of WW 2008 GDP)
Infrastructure
1. Size of bubbles represents
systems economic values
2. Arrows represent the strength of
Healthcare
Legend for system inputs
Same Industry Business Support IT Systems Energy Resources Machinery Materials
$ 12.54 Tn
systems interaction
Source: IBV analysis based on OECD
Govt. & Safety
Economists estimate, that all systems carry inefficiencies of up to $15 Tn, of which $4 Tn could be eliminated
This chart shows systems (not industries)
40% 35% 30% 25% 20%
Analysis of inefficiencies in the planets system-of-systems
Healthcare
Global economic value of
System-of- systems
$54 Trillion
100% of WW 2008 GDP
Inefficiencies
$15 Trillion
28% of WW 2008 GDP
Improvement potential
$4 Trillion
7% of WW 2008 GDP
Building & Transport 34% Infrastructure Education
Electricity
Food & Water
Communication
Government & Safety
Financial 12,540 4,580
3,960 Transportation (Goods & Passenger)
Leisure / Recreation / Clothing 7,800
How to read the chart:
For example, the Healthcare systems value is $4,270B. It carries an estimated inefficiency of 42%. From that level of 42% inefficiency, economists estimate that ~34% can be eliminated (= 34% x 42%).
Note: Size of the bubble indicate absolute value of the system in USD Billions
30% 35% 40% 45%
System inefficiency as % of total economic value
Source: IBM economists survey 2009; n= 480
Improvement potential as % of system inefficiency
History of analytics
History of analytics
Course Outline
Course summary
Course title: Introduction to Data Science and Analytics
Course summary: The objective of the course is to learn analytical models and overview quantitative algorithms for solving engineering and business problems. Data science or analytics is the process of deriving insights from data in order to make optimal decisions. It allows hundreds of companies and governments to save lives, increase profits and minimize resource usage. Considerable attention in the course is devoted to applications of computational and modeling algorithms to finance, risk management, marketing, health care, smart city projects, crime prevention, predictive maintenance, web and social media analytics, personal analytics, etc. We will show how various data science and analytics techniques such as basic statistics, regressions, uncertainty modeling, simulation and optimization modeling, data mining and machine learning, text analytics, artificial intelligence and visualizations can be implemented and applied using Python. Python and Tableau, Power BI are modeling and visualization software used in this course. Practical aspects of computational models and case studies in Interactive Python are emphasized.
Course outline
Introduction to data science and analytics
Data science concepts
Application areas of quantitative modeling
Python programming, data science software
Introduction to Python
Comparison of Python, R and Matlab usage in data science
Basic statistics
Random variables, sampling
Distributions and statistical measures Hypothesis testing
Statistics case studies in Ipython
Overview of linear algebra
Linear algebra and matrix computations Functions, derivatives, convexity
Course outline
Modeling techniques, regression
Mathematical modeling process
Linear regression
Logistic regression
Regression case studies in IPython
Data visualization and visual analytics
Visual analytics
Visualizations in Python
Simulation modeling
Random number generation
Monte Carlo simulations
Simulation case studies in IPython
Optimization
Unconstrained non-linear optimization algorithms
Overview of constrained optimization algorithms
Optimization case studies in IPython 20
Course outline
Advanced machine learning
Decision trees
Advanced supervised machine learning algorithms (Naive Bayes, k-NN, SVM) Intro to ensemble learning algorithms (Random Forests, Gradient Boosting)
Intro to neural networks
Text analytics and natural language processing
Clustering (K-means, Fuzzy C-means, Hierarchical Clustering, DBSCAN)
Dimensionality reduction
Association rules
Overview of reinforcement learning
Machine learning case studies in IPython
Introduction to Deep Learning
Mathematics of neural networks
Introduction to Deep Learning
Convolutional Neural Networks (CNN) 21
Assignments, projects and grading (tentative)
Assignment #1 Solving an analytics problem in Python (12%) Individual assignment.
Assignment #2 Solving an analytics problem in Python (16%) Individual assignment.
Assignment #3 Solving an analytics problem in Python (16%) Individual assignment.
Final Exam Project (24%)
Individual project.
For the final exam project you may be responsible for analyzing, computing and writing up a solution to a practical data science problem in Python. Each project must be completed individually.
Course Project Personalized learning and course curriculum design via machine learning and data analytics in Python (20%)
Group project (groups of 7 students), the same groups as for In-Class Presentations.
In-Class Group Presentation (12%)
Group presentations of up to 10-12 minutes are required to cover topics related to additional course materials and the course project.
Presentations needs to be recorded and uploaded to Quercus. Presentations will be played during lectures and followed up by online Q&A.
All assignments, projects and presentations needs to be completed remotely. Presence at UofT campus is not required for this course. You are encouraged to use online collaboration tools for group project and group presentation preparation.
Course materials and readings
Course slides by O. Romanko and D. Rosu, 2022 Quercus
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Ipython by W. McKinney, 2017 https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/
Getting Started with Data Science: Making Sense of Data with Analytics by M. Haider, 2015
Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Instagram, GitHub, and More by M. Russell and M. Klassen, 2019 https://www.amazon.ca/Mining-Social-Web-Facebook-Instagram/dp/1491985046/
Recommended Literature
Literature
Literature
Literature
Literature
Sources of Data
Data sources
Data files demo in Python
CSV (comma separated value) files
Spreadsheet files, e.g., Excel or Google Spreadsheet
Databases SQL
Internet demo in Python
Web scraping APIs
Big Data platforms and Cloud Hadoop
Cloud (AWS, Google Cloud, Microsoft Azure, IBM Cloud)
Use of data globally and in the financial sector
31 Multiple responses accepted
Use of camera phones at the Papal inauguration in 2005 and 2013
We can collect information from almost everything to make better decisions
30 billion
RFID tags embedded into our world and across entire ecosystems
1 billion Camera phones in
existence able to document accidents, damage, and crimes
Of new automobiles
will contain event data recorders collecting travel information
Instrumented Interconnected Intelligent
What is big data?
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools.
Difficulties include capture, storage, search, sharing, analytics, and visualizing.
Source: Wikipedia
Big social data
Analytics Examples
Data reveals hidden city dynamics
Applications of big data analytics
Smarter Healthcare
Homeland Security
Manufacturing
Multi-channel sales
Log Analysis
Search Quality
Retail: Churn, NBO
Traffic Control
Trading Analytics
Fraud and Risk
Marketing analytics
Fitting room analytics
Source: Adme
IBM Debater
Modeling
Questions that we can try to answer with models
Statistics exploratory analysis and hypothesis testing Decisions made from samples
Hypothesis testing
Machine learning learning from examples
Supervised learning (prediction, classification)
Unsupervised learning (clustering, dimensionality reduction, associations)
Artificial intelligence advanced analytics Text analytics, social media analytics, NLP Spatio-temporal analytics
Image and visual recognition (deep learning) Reinforcement learning and autonomous systems
Modeling uncertainty what would happen in the future? Monte Carlo simulations
Optimizing decisions whats best? Optimization
Finding connections is FB related to Cambridge Analytica? Graph/network models
Models and reality
Simplified abstraction of reality
Capture essence of problem
Calculations
From Monahan, G., Management Decision Making, Cambridge University Press, 2000
Interpretation
Real World
Analysts World
Artificial Intelligence
Text analytics and sentiment analysis
Sentiment analysis of tweets
Natural Language Processing: features and target variable in
sentiment analysis
features (words) target
bear tea love bad drink sentim All bears are lovely 56% Our tea was bad -35% That bear drinks with bear -5% The bear drinks tea 4% We love bears 63%
Stop words that were removed:
are, was
examples (news articles)
Natural Language Processing: bag of words based on Word Frequency (WF) and sentiment analysisfeatures (WF) target
bear tea love bad drink sentim All bears are lovely 56% Our tea was bad -35% That bear drinks with bear -5% The bear drinks tea 4% We love bears 63%
Supervised machine learning algorithm:
Linear regression
Decision trees
SVM regression
k-NN regression
Ensembles (random forests, XGBoost)
Artificial neural nets (deep learning)
10100 01010 20001 11001
bag of words
examples (news articles)
Natural Language Processing: word frequency (Word Cloud)
Word Cloud about Toyota
Neural networks and deep learning
Based loosely on computer models of how brains work
Model is an assembly of inter-connected neurons (nodes) and weighted links
Each neuron applies a nonlinear function to its inputs to produce an output
Output node sums up each of its input value according to the weights of its links
Used for classification, pattern recognition, speech recognition
model no explanatory power, very hard to interpret the results
Training ANN means learning the weights of the neurons
x2 Input Layer
f6 y Output Layer
f7 Hidden Layer
Neural networks and deep learning
Spatio-temporal analytics car theft hotspots
Source: Booz Allen Hamilton, Field Guide to Data Science
IBM Cloud all services
http://cloud.ibm.com
Click Catalog at the top of the dashboard
IBM Cloud Watson AI / ML services
http://cloud.ibm.com
Click Catalog at the top of the dashboard
IBM Cloud Speech to Text service
IBM Cloud Speech to Text service
Copy API Key to your Python code
Case study Watson Speech-to-Text, Natural Language Understanding and Text-to-Speech prototype on IBM Cloud
Crowd-Sourced Analytics
Bellingcat open source investigations
Source: Bellingcat, Russias War in Ukraine: The Medals and Treacherous Numbers
Bellingcat open source investigations
Source: Bellingcat, Russias War in Ukraine: The Medals and Treacherous Numbers
Bellingcat open source investigations
A number of awarded medals For Distinction in Combat is 4300 between 07.11.2014 and 18.02.2016, strongly suggests larger combat operations with active Russian military involvement in this period. In sum, the data suggests that more than 10000 medals of all four considered types were awarded in the considered period.
Source: Bellingcat, Russias War in Ukraine: The Medals and Treacherous Numbers
Online and In-Person Education
Coursera (coursera.org)
EdX (edx.org)
CognitiveClass MOOC
http://CognitiveClass.ai
CognitiveClass MOOC (http://CognitiveClass.ai) Free courses, free
study materials
Cloud-based sandbox for exercises
2000000+ registered students
60+ courses
TED and TEDx
Register for free 6-month access with your email @mail.utoronto.ca
https://www.datacamp.com/groups/shared_links/9213435974cadaa4336fc2d5728cb94d5a0958d02a5b935ec2e56fa3544838a7
To Do before Lecture 2
Run IPython examples provided in class
Install Python on your laptop
Recommended to use Python version 3.X
You may use your own Python distribution, Anaconda distribution is recommended to install https://www.anaconda.com/products/individual
Use Python on cloud via Google Colab
You can use Python on Google cloud via https://colab.research.google.com
Use Python on cloud via IBM CognitiveClass.ai Virtual Lab (optional) Register for CognitiveClass.ai MOOC portal https://cognitiveclass.ai to access
60+ free data science courses and to use Python on the CC cloud You can use Python on CC cloud via https://labs.cognitiveclass.ai
Get access to IBM Cloud (optional)
Sign-in for IBM Academic Initiative and register for access to IBM Cloud, or get
free lite access to IBM Cloud directly at https://cloud.ibm.com/registration Check class web-page on Quercus
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.