Information Management
Data Warehousing (thanks to Wolf- )
Universita degli Studi di support systems
Copyright By Assignmentchef assignmentchef
The amount of data generated and collected every day is increasing
Decisions are taken based on information extracted from data
Bad decisions can lead to disaster
A decision support system is a set of IT solutions and tools used for extracting information from data electronically stored
Past: describe the past, describe problems, reduce costs
Future: anticipate the future, suggest corrections, increase profits
Issues in realizing a DSS: large datasets, heterogeneous information sources, history of data,
Data warehouse
Data warehousing is at the base of decision support systems
Questions we need to answer [Kimball 1996]
We have many data but we cannot access them
How is it possible that different people in the same role have substantially different results?
We need to select, group, and manipulate data in any possible manner
Show me only what is important
Everybody knows that some data are not correct
Data warehouse: requirements
Accessibility also by users with limited technological background
Integration of different data sources
Flexibility of query operations
Synthesis for easy analysis
Multidimensional representation to provide an intuitive view of the data
Correctness and completeness of represented data
Data warehouse: definition (1)
Intuitively, a data warehouse is a large database
Not all large DBs are DWs, but DWs are large DBs (from a TB to several PB) Distributed over several servers and requires high computational power
A data warehouse is a collective data repository
Contains snapshots of the operational data (history)
Populated through the Extract-Transform-Load (ETL) process
Useful for data analytics
Data warehouse: definition (2)
Compared to other solutions it
Is suitable for tactical/strategic focus
Implies a small number of transactions
Implies large transactions spanning over a long period of time
Data warehouse: Inmon definition
A data warehouse is a:
Subject oriented
Integrated
Non-volatile
Time variant
collection of data in support of managements decisions (by)
Subject oriented
Data are organized and centered around the concepts of interest for the company
Data elements relating to the same real-world event or object are linked together
Traditional operational DBs are organized according to applications Examples of subjects: customers, products, orders,
Integrated
The DW contains data from most or all the organizations operational systems and these data are made consistent
Example: gender, dates, measures
The realization of a DW does not imply the insertion of new data, but the reorganization of existing ones
Non-volatile
Data are never over-written nor deleted Data are read-only
Time varying
A DW represents a sequence of snapshots of operational data, repeated periodically
Changes over the data are never lost
Time horizon
Operational system: 60-90 days DW: 5-10 years
OLTP: On Line Transaction Processing
Traditional elaboration of transactions, which realize operative systems
Predefined simple operations
Each operation involves a limited amount of data Detailed and always up-to-date information
ACID properties are crucial
OLAP: On Line Analytical Processing
Elaboration of operations for decision support
Complex and non pre-defined operations
Each operation involves a huge amount of data
Aggregated and historical data, possibly outdated
ACID properties are not relevant (typically read-only)
OLTP and OLAP: separate environments
OLTP Database
Data Warehouse
OLTP APPLICATION
OLAP APPLICATION
FINAL USERS (transactions)
ANALYSTS (complex queries)
OLTP vs. OLAP (1)
Operational Data Stores (OLTP)
Mostly updates
Mostly reads
Many small transactions
Few, but complex queries
MB-TB of data
TB-PB of data
Summarized data
Final users / employees
Decision makers and analysts
Up-to-date data
May be slightly outdated data
OLTP vs. OLAP (2)
Typical questions to be answered by OLAP
How much did sales unit A earn in January?
How much did sales unit B earn in February?
What was their combined sales amount for the first quarter?
Answering these questions with SQL-queries is difficult
Complex query formulation necessary
Processing will be slow due to complex joins and multiple scans
Why can such questions be answered faster with a DW? DW data is rearranged and pre-aggregated data collection
The table arrangement is subject oriented
OLAP: FASMI requirements
Fast: response times should be low, even if dealing with huge amounts of data
Analytical: should provide support for the final user to analyze data in a simple manner
Shared: data should be shared among multiple final users
Multidimensional: data should be organized according to a
multidimensional model
Informational: the system should store all the information of interest
DW architectures
A DW architecture should be
Independent: separated from from OLTP systems
Scalable: support a growing size of data and number of final users/analysts Extensible: able to integrate novel applications whenever needed
Secure: enforce access control
Easy to manage
1-level architecture
The DW is virtual
Implemented as a multidimensional view over the data in the OLTP Generated through a middleware
Does not guarantee separation between OLTP and OLAP systems Not used in practice
Operational D
Multidimensional alanysys
Data mining
2-level architecture (1)
Extract Clean Transform Load
Data Warehouse
External sources
Multidimensional analysis
Operational DBs
Data mining
2-level architecture (2)
Data sources: heterogeneous information sources both internal and external
Usually operational data stores that rely on relational database technology
ETL: data from different and heterogeneous sources should be extracted
cleaned to remove inconsistencies and completed of missing parts integrated according to a common schema
2-level architecture (3)
Staging area: contains data that should be loaded in the DW from sources
Copied data are prepared through ETL
These data are not used for analysis but only for populating the DW
DW server: the DW is the logical container of collected data, which can be directly queried or used to build data marts
Cleaned raw data
Derived (aggregated) data Metadata
Analysis: analyzes data and builds reports
2-level architecture (4)
Data mart: subset or result of aggregations of the data in the DW Includes the subset of data of interest for a specific business area
Provides better performance
Advantages of 2-level architectures
DW is always available, even when the data sources are temporarily
unavailable
Queries operating on the DW do not interfere with OLTP transactions
Data in DW and OLTP are kept independent (different data model, different data granularity, different time window)
Specific optimization techniques to support analysis
3-level architecture (1)
External sources
Operational DBs
Reconciled Data
Data Warehouse
Extract Clean Transform Load
Multidimensional analysis
Data mining
3-level architecture (2)
The third (additional) level is represented by reconciled data
Reconciled data (operational data store) is the materialization of the
result of ETL process
In 2-level architectures the result of ETL is not materialized
Applications for DWs
A DW is the base repository for front-end analytics (or business cockpits)
Knowledge discovery in databases (KDD) and data mining
Results are used for Data visualization
Reporting
Decision support system (DSS) analysts
usually have a business background
aim to define and discover information used in corporate decision-making do not know exactly what they need
operate in an explorative manner (by subsequent refinement steps)
DW design and implementation
Top-down: analyze the whole needs of the company and design the DW in its entirety
Long design and implementation times Complex task
Bottom-up: the DW is built incrementally by iteratively integrating different data marts
Reduces times
Reduces costs
Most widely used solution
Bottom-up DW development
Define objectives and plan
preliminary phase including feasibility analysis
Infrastructure design
analyze and compare architectural solutions and available technologies
Data mart design and development create data mart and applications
integration with the existing system
Data mart design
Data mart modeling is based on the schemas of the underlying operational DBs
1. Analysis and reconciliation of data sources
2. Requirements analysis
3. Conceptual modeling
4. Refinement of workload and conceptual model validation
5. Logical modeling
6. Loading modeling
7. Physical modeling
Analysis and reconciliation of data sources
INPUT: operational DBs schemas OUTPUT: reconciled schema
Analyzes operational data to determine and analyze: the schemas of data sources
useful correlations among data
portions of data of interest for the decision process
data quality
Makes different data sources homogeneous and enables their integration
Requirements analysis
INPUT: reconciled schema
OUTPUT: facts, preliminary workload
Facts are concepts of primary interest for the decisional process
For workload, it is important to determine the granularity of facts Fine-grained: less performance, more precision
Coarse-grained: higher performance, lower precision
Conceptual modeling
INPUT: reconciled schema, facts, preliminary workload OUTPUT: fact schema
The fact schema represents all the relevant information about facts that need to be considered
Measures: atomic properties of the fact that can be analyzed Dimension: perspective for the analysis
Fact: sales
Measures: sold quantities, gain
Dimensions: product, time, place
Workload and validation
INPUT: fact schema, preliminary workload OUTPUT: validated fact schema
Formulate queries representing the workload to determine whether the fact schema supports all of them
Logical modeling (1)
INPUT: fact schema, logical model, workload OUTPUT: logical model of the data mart
Identify views that should be materialized
Design vertical and/or horizontal fragmentation to maximize
performance
Choose the logical model between ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP)
Logical modeling (2)
ROLAP (Relational OLAP)
Relational database model, extended and adapted
Data organized in tables and analysis operations expressed as SQL instructions Access structures specifically designed to optimize analysis operations
MOLAP (Multidimensional OLAP)
Stores data directly in multidimensional shape Proprietary data structures
Loading modeling
INPUT: operational DBs schemas, reconciled schema, logical model of the data mart
OUTPUT: loading procedures
Physical modeling
INPUT : logical model of the data mart, target DBMS, workload OUTPUT: physical schema of the data mart
Identify indexes, based also on the workload and data volumes
Conceptual Modeling
Dimensional Fact Model (DMF)
The Dimensional Fact Model (DFM) is a graphical conceptual model for data mart design aimed at
support conceptual modeling
create an environment where queries can be formulated in an intuitive
enable requirement specification
represent a reference point for logical modeling
provide project documentation
DFM conceptual schema is a set of fact schemas that model facts, measures, dimensions, and hierarchies
A fact is a concept of interest for the decisional process Usually models a set of events that happen within a company Is dynamic and evolves over time
A measure is a numeric property of a fact and describes a quantitative aspect of interest for the analysis
A dimension is a property of a fact that has a finite domain and
describes an aspect of the fact for the analysis
dimensions
The relationship between measures and dimensions is expressed, at the extensional level (i.e., instance level) through (primary) events
A primary event is a specific occurrence of a fact
It is identified through an ennuple with a value for each dimension It has a value for each measure
A fact is a many-to-many relationship among dimensions
Event for fact SALE:
on 3/4/2019, store X sold 10 units of product Y with a receipt of 25 USD
Dimensional attributes are dimensions and the attributes, with discrete values, describing them
Hierarchy is a directed tree whose nodes are dimensional attributes and whose edges model many-to-one relationships between pairs of dimensional attributes
The dimension is the root of the tree The other nodes are its attributes
productatype productabrand typeacategory
expresses the type of a product (beer is a drink) expresses the brand of a product
each type belongs to a category (drink is food)
Hierarchy: example
dimensional attributes
Each DMF can be modelled as an ER
The fact is a relation
Each measure is an attribute Each dimension is an entity
Secondary event
Given a set of dimensional attributes, each ennuple of values identifies a secondary event
A secondary event aggregates the corresponding primary events
A secondary event has a value for each measure, summarizing the values of
the same measure in the corresponding primary events
Aggregates all the sales in Milan for product X in January 2021 The quantity will be the sum of all the tuples with these values
Descriptive and optional attributes
Descriptive attribute: property of a dimensional attribute or hierarchy Not used for aggregation
Leaf node
Optional attributes:
For some instances the attribute may
not have a value
Telephone number and address are descriptive of a store
Diet is optional for a product
Convergence
Two or more attributes are connected by two or more distinct paths, all representing functional dependencies
storea storeCitya state a country store a saleDistrict a country
Conceptual design
Starts from documentation including ER schema, relational schema,
Identify facts
For each fact
1. Build the attribute tree
2. Identify dimensions
3. Identify measures
4. Draw the fact schema
Identify facts
A fact is a concept of primary interest for the decisional process Typically events that dynamically happen in the company
Considering an ER schema, entities that represent frequently updated archives are good candidate facts
Sales is a good candidate fact Shop is not a good candidate fact
Build the attribute tree
Considering an ER schema, and entity F modeling a fact, the attribute tree is such that
Each vertex is an attribute of the ER schema
The root corresponds to the identifier of F
For each vertex v, the corresponding attribute functionally determines all the attributes represented by descendants of v
Build the attribute tree: example (1)
Build the attribute tree: example (2)
Prune the attribute tree
Remove dimension attributes (single nodes or entire subtrees) that are not relevant for the DW
Identify dimensions
Dimensions determine how events can be aggregated for the decisional process
Dimensions are chosen among the children of the root
The choice of dimensions is crucial as it defines the granularity of
primary events
Time is usually a dimension
Identify measures
Measures are chosen among numeric attributes the children of the root or applying aggregates to other attributes in the tree
A fact can be empty (i.e., with no measure)
Draw the fact schema
The attribute tree is translated into a DFM schema
Logical Modeling
Multidimensional model: Cube (1)
A cube represents a fact
Each of the axis of the cube represents a dimension of the fact Each cell in the cube represents a primary event
A slice of the cube represents a secondary event
storeCity quantity
Multidimensional model: Cube (1)
A cube can have an arbitrary number of dimensions
Cells in a cube can be empty Sparse cube: many empty cells Dense cube: few empty cells
ROLAP logical modeling does not suffer from sparsity issue
Star schema
It includes:
A dimension table DTi for each dimension, with a primary key di and different attributes for each aggregation level
A fact table FT with an external key di for each dimension and an additional attribute for each measure
Star schema: example
category department marketingGroup
unitPrice numberOfCustomers
week month day holiday quarter Year
storeCity state
country salesManager salesDistrict
Snowflake schema
Results from a (partial) normalization of a star schema
Prevents or reduces redundancies in dimension tables
Given a star schema, decompose one or more dimension table DTi in multiple tables DTi,1 DTi,n to remove functional dependencies
Each dimension table DTi,j includes A primary key di,j
A subset of the attributes that functionally depend from di,j
Possibly external keys for other dimension tables DTi,k to enable the reconstruction of DTi
Snowflake schema: example
week month day holiday quarter Year
marketingGroup
CategoryId
CategoryId
category department
unitPrice numberOfCustomers
brand brandCity TypeId
salesManager salesDistrict cityId
storeCity state country
Snowflake schema vs star schema
Snowflake schema
reduces the storage space
requires to insert additional external keys
facilitates joins including only attributes in primary dimension tables (dimension tables are smaller)
complicates joins involving attributes in secondary dimension tables (higher number of joins)
Dimensions (1)
Dimensions are the core of the multidimensional model
Dimensions are used for Selecting data
Grouping data
Dimensions consist of dimension values Product: milk, cream
Time: 1/1/2019, 2/1/2019
Values are possibly organized in order
Used for comparing values (e.g., income of different months)
Dimensions (2)
Dimensions encode hierarchies with different levels
Dimension values are organized in a tree or in a lattice, whose top element is ALL
Product > Type > Category
Day > Month > Quarter > Year
Granularity of the bottom level is important and has an impact on scalability
Measures (1)
A measure represents a property of a fact that users want to study and analyze
A measure has
Numerical value
Aggregation formula
A measure is additive along a dimension if the SUM operator can be used to aggregate along that dimension
Measures (2)
Additive measures
Additive along all dimensions
Refer to a timeframe, at the end of which they can be evaluated cumulatively Example: number of products sold in a day
Semi-additive measures
Additive only along some dimensions
Evaluated at a particular time
Example: number of products in inventory (not additive on time)
Non-additive measures
Additive along none of the dimensions
Are evaluated at particular times and are relative Example: unit price of a product
Logical design
1. Translate DFM schema into a logical (star or snowflake) schema
2. Materialize views
3. Horizontal or vertical fragmentation of fact table
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.