Information Management
Data Warehousing (thanks to Wolf- )
Universita degli Studi di support systems
Copyright By Assignmentchef assignmentchef
The amount of data generated and collected every day is increasing
Decisions are taken based on information extracted from data
Bad decisions can lead to disaster
A decision support system is a set of IT solutions and tools used for extracting information from data electronically stored
Past: describe the past, describe problems, reduce costs
Future: anticipate the future, suggest corrections, increase profits
Issues in realizing a DSS: large datasets, heterogeneous information sources, history of data,
Data warehouse
Data warehousing is at the base of decision support systems
Questions we need to answer [Kimball 1996]
We have many data but we cannot access them
How is it possible that different people in the same role have substantially different results?
We need to select, group, and manipulate data in any possible manner
Show me only what is important
Everybody knows that some data are not correct
Data warehouse: requirements
Accessibility also by users with limited technological background
Integration of different data sources
Flexibility of query operations
Synthesis for easy analysis
Multidimensional representation to provide an intuitive view of the data
Correctness and completeness of represented data
Data warehouse: definition (1)
Intuitively, a data warehouse is a large database
Not all large DBs are DWs, but DWs are large DBs (from a TB to several PB) Distributed over several servers and requires high computational power
A data warehouse is a collective data repository
Contains snapshots of the operational data (history)
Populated through the Extract-Transform-Load (ETL) process
Useful for data analytics
Data warehouse: definition (2)
Compared to other solutions it
Is suitable for tactical/strategic focus
Implies a small number of transactions
Implies large transactions spanning over a long period of time
Data warehouse: Inmon definition
A data warehouse is a:
Subject oriented
Integrated
Non-volatile
Time variant
collection of data in support of managements decisions (by)
Subject oriented
Data are organized and centered around the concepts of interest for the company
Data elements relating to the same real-world event or object are linked together
Traditional operational DBs are organized according to applications Examples of subjects: customers, products, orders,
Integrated
The DW contains data from most or all the organizations operational systems and these data are made consistent
Example: gender, dates, measures
The realization of a DW does not imply the insertion of new data, but the reorganization of existing ones
Non-volatile
Data are never over-written nor deleted Data are read-only
Time varying
A DW represents a sequence of snapshots of operational data, repeated periodically
Changes over the data are never lost
Time horizon
Operational system: 60-90 days DW: 5-10 years
OLTP: On Line Transaction Processing
Traditional elaboration of transactions, which realize operative systems
Predefined simple operations
Each operation involves a limited amount of data Detailed and always up-to-date information
ACID properties are crucial
OLAP: On Line Analytical Processing
Elaboration of operations for decision support
Complex and non pre-defined operations
Each operation involves a huge amount of data
Aggregated and historical data, possibly outdated
ACID properties are not relevant (typically read-only)
OLTP and OLAP: separate environments
OLTP Database
Data Warehouse
OLTP APPLICATION
OLAP APPLICATION
FINAL USERS (transactions)
ANALYSTS (complex queries)
OLTP vs. OLAP (1)
Operational Data Stores (OLTP)
Mostly updates
Mostly reads
Many small transactions
Few, but complex queries
MB-TB of data
TB-PB of data
Summarized data
Final users / employees
Decision makers and analysts
Up-to-date data
May be slightly outdated data
OLTP vs. OLAP (2)
Typical questions to be answered by OLAP
How much did sales unit A earn in January?
How much did sales unit B earn in February?
What was their combined sales amount for the first quarter?
Answering these questions with SQL-queries is difficult
Complex query formulation necessary
Processing will be slow due to complex joins and multiple scans
Why can such questions be answered faster with a DW? DW data is rearranged and pre-aggregated data collection
The table arrangement is subject oriented
OLAP: FASMI requirements
Fast: response times should be low, even if dealing with huge amounts of data
Analytical: should provide support for the final user to analyze data in a simple manner
Shared: data should be shared among multiple final users
Multidimensional: data should be organized according to a
multidimensional model
Informational: the system should store all the information of interest
DW architectures
A DW architecture should be
Independent: separated from from OLTP systems
Scalable: support a growing size of data and number of final users/analysts Extensible: able to integrate novel applications whenever needed
Secure: enforce access control
Easy to manage
1-level architecture
The DW is virtual
Implemented as a multidimensional view over the data in the OLTP Generated through a middleware
Does not guarantee separation between OLTP and OLAP systems Not used in practice
Multidimensional alanysys
Middleware
Operational DBs
Data mining
2-level architecture (1)
Extract Clean Transform Load
Data Warehouse
External sources
Multidimensional analysis
Operational DBs
Data mining
2-level architecture (2)
Data sources: heterogeneous information sources both internal and external
Usually operational data stores that rely on relational database technology
ETL: data from different and heterogeneous sources should be extracted
cleaned to remove inconsistencies and completed of missing parts integrated according to a common schema
2-level architecture (3)
Staging area: contains data that should be loaded in the DW from sources
Copied data are prepared through ETL
These data are not used for analysis but only for populating the DW
DW server: the DW is the logical container of collected data, which can be directly queried or used to build data marts
Cleaned raw data
Derived (aggregated) data Metadata
Analysis: analyzes data and builds reports
2-level architecture (4)
Data mart: subset or result of aggregations of the data in the DW Includes the subset of data of interest for a specific business area
Provides better performance
Advantages of 2-level architectures
DW is always available, even when the data sources are temporarily
unavailable
Queries operating on the DW do not interfere with OLTP transactions
Data in DW and OLTP are kept independent (different data model, different data granularity, different time window)
Specific optimization techniques to support analysis
3-level architecture (1)
External sources
Operational DBs
Reconciled Load Data
Data Warehouse
Extract Clean Transform Load
Multidimensional analysis
Data mining
3-level architecture (2)
The third (additional) level is represented by reconciled data
Reconciled data (operational data store) is the materialization of the
result of ETL process
In 2-level architectures the result of ETL is not materialized
Applications for DWs
A DW is the base repository for front-end analytics (or business cockpits)
Knowledge discovery in databases (KDD) and data mining
Results are used for Data visualization
Reporting
Decision support system (DSS) analysts
usually have a business background
aim to define and discover information used in corporate decision-making do not know exactly what they need
operate in an explorative manner (by subsequent refinement steps)
DW design and implementation
Top-down: analyze the whole needs of the company and design the DW in its entirety
Long design and implementation times Complex task
Bottom-up: the DW is built incrementally by iteratively integrating different data marts
Reduces times
Reduces costs
Most widely used solution
Bottom-up DW development
Define objectives and plan
preliminary phase including feasibility analysis
Infrastructure design
analyze and compare architectural solutions and available technologies
Data mart design and development create data mart and applications
integration with the existing system
Data mart design
Data mart modeling is based on the schemas of the underlying operational DBs
1. Analysis and reconciliation of data sources
2. Requirements analysis
3. Conceptual modeling
4. Refinement of workload and conceptual model validation
5. Logical modeling
6. Loading modeling
7. Physical modeling
Analysis and reconciliation of data sources
INPUT: operational DBs schemas OUTPUT: reconciled schema
Analyzes operational data to determine and analyze: the schemas of data sources
useful correlations among data
portions of data of interest for the decision process
data quality
Makes different data sources homogeneous and enables their integration
Requirements analysis
INPUT: reconciled schema
OUTPUT: facts, preliminary workload
Facts are concepts of primary interest for the decisional process
For workload, it is important to determine the granularity of facts Fine-grained: less performance, more precision
Coarse-grained: higher performance, lower precision
Conceptual modeling
INPUT: reconciled schema, facts, preliminary workload OUTPUT: fact schema
The fact schema represents all the relevant information about facts that need to be considered
Measures: atomic properties of the fact that can be analyzed Dimension: perspective for the analysis
Fact: sales
Measures: sold quantities, gain
Dimensions: product, time, place
Workload and validation
INPUT: fact schema, preliminary workload OUTPUT: validated fact schema
Formulate queries representing the workload to determine whether the fact schema supports all of them
Logical modeling (1)
INPUT: fact schema, logical model, workload OUTPUT: logical model of the data mart
Identify views that should be materialized
Design vertical and/or horizontal fragmentation to maximize
performance
Choose the logical model between ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP)
Logical modeling (2)
ROLAP (Relational OLAP)
Relational database model, extended and adapted
Data organized in tables and analysis operations expressed as SQL instructions Access structures specifically designed to optimize analysis operations
MOLAP (Multidimensional OLAP)
Stores data directly in multidimensional shape Proprietary data structures
Loading modeling
INPUT: operational DBs schemas, reconciled schema, logical model of the data mart
OUTPUT: loading procedures
Physical modeling
INPUT : logical model of the data mart, target DBMS, workload OUTPUT: physical schema of the data mart
Identify indexes, based also on the workload and data volumes
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.