Tree-based indexes (3)
B-trees and B+-trees support searches along one dimension only Not suited for multi-dimensional data like DWs
Multidimensional trees are an evolution of B-trees for multidimensional data
Describe sets of points through geometric regions containing clusters of points
Copyright By Assignmentchef assignmentchef
Clusters are used for search operations
Clusters are organized in a hierarchical structure
R-tree (1)
Each data item is seen as a point in an n-dimensional space
A dimension for each dimensional attribute
The coordinates are the values of the data item for each dimension
Data are organized in Minimum Bounding Rectangles (MBRs) depending on their position in the n-dimensional space
Each node corresponds to an MBR bounding its children
Leaf nodes point to database objects
Note: MBR are possibly overlapping
R-tree (2)
R-tree of order (n,M), with m>=M/2
Each internal node has between m and M children
Each leaf node includes between (pointers to) m and M data items The root has at least two children (unless it is a leaf node)
All the leaf nodes are at the same level
R-tree: example
R-tree Search
The search operates similarly to B+-trees
Given a query rectangle Q, starting from the root node:
Find the child (children, resp.) of the current node intersecting with Q Recursively visit the selected child (children, resp.)
When reaching a leaf node, return the data items of interest
Could visit the whole tree No performance guarantee
R-tree Search: example (1)
R-tree Search: example (2)
R-tree Insert (1)
When inserting a new data item d, it is necessary to
choose a MBR in a leaf where d can be inserted
If d falls in a MBR in a leaf insert d into the MBR
If d does not fall in a MBR in a leaf
insert d into the MBR that causes the smallest volume growth reduces dead space
minimizes the probability of unnecessary intersection in query
evaluation
heuristic approach to choose such MBR
(e.g., choose the MBR with smallest overall volume)
R-tree Insert (2)
If the MBR is full (i.e., already includes M data items)
the rectangle must be split
the parent node must be updated
split can propagate recursively until the root, possibly causing a growth in the
height of the tree
R-tree Insert: example
Splitting a node
The goal of splitting a MBR is to limit overlaps to minimize the need to traverse both the resulting nodes
Not a simple task!
Good split Bad split
R-tree Delete
Search the leaf storing the data item d to be deleted
Delete d from the MRB
If the MBR includes less than m data items, condense the MBR (i.e., erase the node and re-insert its data items)
R-tree Delete: example
R-tree Update
If a data item d is updated, the bounding of the MBR where it is located might change
The data item should be removed and re-inserted
Notes on R-tree
The R-tree that maximizes search efficiency is the one minimizing overlaps and dead spaces
Improvements over R-trees
R+-trees: no overlap between MBRs
R*-trees: improve the split algorithm
X-trees: scalability up to 20 dimensions
Bitmap indexes
A bitmap is an array data structure where each cell stores a bit (0 o 1 value)
Bitmap indexes are mainly used in data warehouses because data are not frequently updated
Considering an attribute:
Different binary configurations of the bitmap represent different domain
Smaller domains imply shorter bitmaps
Space occupied by the index
(number of values)*(number of rows)*1bit
Bitmap indexes: example
Geography Bitmap index on (dimension) Sales (fact) Sales
Bitmap on the fact table for dimension Geography over attribute Shop
6 bits, one per row of table Sales
Value 1 if the Shop corresponds to the row of the bitmap, 0 otherwise
Bitmap indexes: removal
If a record is removed from the fact table, the number of bits in the bitmap does not change
Its value is set to 0 in all the bitmaps Sales (fact)
Bitmap index on Sales
Bitmap indexes: insertion
If a record is inserted into the fact table, the number of bits in the bitmap increases by one
Its value is set to 0 in all the bitmaps already in the table
Geography Bitmap index on (dimension) Sales (fact) Sales
Bitmap indexes: query evaluation
Bitmap indexes are efficient for the evaluation of AND, OR, XOR operations (and the corresponding set intersection/union/difference)
Useful especially when selectivity is high
Select the sum spent at P&C or Saturn Bitmap = 001000 v 100010 = 101010
Bitmap index on Sales
Bitmap indexes: pros and cons
Advantages
Operations are efficient and easy to implement
Disadvantages
Bitmaps are sparse (many 0s)
Solution: adopt compression techniques
For each new attribute, a new bitmap vector is needed
Solution: multi-component bitmaps Not useful for range queries
Solution: range-encoded bitmaps
Join optimization
Due to the structure of data warehouses, the evaluation of queries often requires the execution of join operations
Mainly join between the fact table and dimension tables
Joins are expensive operations and choosing a different order or
algorithm for their evaluation can make the difference
In OLTP, try to avoid joins between tables with no common attributes
In DWs cross products are quite frequent
Use adequate index structures to reduce their cost
Use of materialized views
Materialized views can reduce query evaluation costs (e.g., pre- computing aggregates or join results)
To this aim, it is necessary to integrate a materialized view M in a query Q, obtaining Q, in such a way that Q is equivalent to Q (i.e., is guaranteed to produce the same result)
Selection conditions on M cannot be more restrictive than in Q The projection in Q should be a subset of the projection in M
Selection conditions in Q should be possible on M
Aggregates in Q should be computable over M
Exercise on bitmap indexes
Build the bitmap index for attribute PRODUCT and the bitmap index for attribute CITY.
Write the condition operating on bitmap indexes to filter sales in Milan for products P1 and P3.
Sales (fact)
ETL: Extract Transform Load (1)
Three database functions that are combined into one tool to pull data out of operational DBs and place it into the DW
Migrate data from one database to another, to form data marts and data warehouses
Convert databases from one format or type to another
When should we ETL?
Periodically (e.g., every night, every week) or after significant events Refresh policy set by administrator based on user needs and traffic
Possibly different policies for different sources
Rarely, on every update (real-time DW)
ETL: Extract Transform Load (2)
Integrate heterogeneous systems (different DBMS, operating system, hardware, communication protocols, etc.)
ETL challenges
Getting the data from the source to target as fast as possible
Allow recovery from failure without restarting the whole process
Data managed by ETL tools are in the staging area Cannot be analyzed and used for reporting
Can only be used to feed the DW
ETL is often under estimated and is time-consuming
Data staging area
Area where data are stored during the ETL process
Data in the staging area cannot be accessed/queried/analyzed Supports sequential operations on large data collections
Cleaned and elaborated data are then inserted into data marts
Structures for the staging area
Flat files
No overhead due to the presence of a DBMS infrastructure
No functionalities and indexes offered by DBMS
Used for persistent storage of data
ETL based on scripts enable efficient searches, filtering, replacing
XML datasets
Useful standard for input/output of the ETL system Can rely on XPath, XQuery
Relational tables
Can be easily used also in absence of dedicated ETL tools
Advantages of RDBMS: integrity, SQL interface, clear metadata May be slower
ETL process
Create a high-level schema of the flow of information
Choose the most adequate ETL tool
Identify complex data transformation and identify keys for tables
Construct dimension tables
Construct one static dimension
Construct update mechanisms for one dimension Construct the remaining dimensions
Construct fact tables
Construct the fact table
Define incremental updates
Construct aggregates
Design and construct ETL automation
High-level diagram
Highlight tables including data that will be aggregated to feed the data warehouse
Raw-product
Add product type
Check ref. integrity
Aggregate sales per product per day
Extract time
Building dimensions
Static dimension table
Identify and assign keys to dimension tables
Maintain a mapping between production keys and data warehouse keys using a table
Handle changes in dimensions
Maintain updates in the mapping between production and data warehouse
Load dimension tables
Replace -> small tables
Load only changes -> large tables
Building fact tables
Initial load
Done when the DW is first created ETL for all data in production files Very heavy
Incremental updates
Move only changes from last load Performed periodically
Less heavy
Dimensions must be updated before facts
Relevant dimension rows for new facts must be in the fact table
Extract (1)
Data are read from a data source
Internal scripts/tools at the data source, which export the data to be used External programs, which extract the data from the source
Extracted data are stored in the staging area
Data sources can be of two types
Non-cooperative (e.g., snapshot, specific, logged, queryable sources) Cooperative (e.g., replicated, call-back, internal action)
Extract (2)
Different kinds of extraction methods, including SQL-based applications
DB unload tools
Specific applications
Extraction is a time consuming operation
Usually only changes are extracted periodically
Extract: computing deltas
Delta: changes in the data source since the last load
Store extracted totals in the staging area Always possible
Handles deletions
High extraction times
Include update timestamp in all rows at the data sources Requires revision of the source systems
Reduces extract times
Cannot (alone) handle deletion
Translates the original format of the data into a desired format/state for the DW to be populated
Transform implies: data type conversion, normalization/de- normalization, building keys
Includes two sub-phases Data cleaning
Data integration
Transform: data quality
Data quality is important for data warehouses
Data should be Precise
Complete Consistent Unique
Data cleaning (1)
Often data extracted from different sources are not of adequate quality
Data cleaning aims at increasing data quality before inserting data in the DW
Data cleaning can precede, follow, or work in parallel to data integration
Manually clean data takes too long
Data cleaning uses semi-automatic tools and anomaly detection techniques
Thesaurus, regular expressions, geographical information,
Data cleaning (2)
The errors that data cleaning usually aims to correct include
Incomplete data: define a coding for missing information, try to reconstruct the missing data or to foresee it, ask input to the operator
Incorrect data (e.g., impossible values): syntactic rules and dictionaries to identify typos or different representation conventions
Inconsistent data: identify rules that establish a correlation, and use them to identify errors
Integration
Several conceptual schemas need to be combined into a unified global schema
Differences in terminology must be resolved Redundancy must be removed
Fast loading is not an easy task SQL-based loading is not efficient DB load tools are faster
Rebuilding indexes as a consequence of load operations is an expensive task
Parallelization is important for improving efficiency Dimensions can be load concurrently
Fact tables can be load concurrently
Partitions can be load concurrently
Initial load
Deliver dimensions tables Deliver fact tables
Continuous load
Must be scheduled and proceed in a specific order to guarantee that integrity
and completeness constraints are not violated Should be carefully planned
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.