Information Management
Universita degli Studi di Databases (slides partially provided by Prof. Samarati)
Copyright By Assignmentchef assignmentchef
Paradigms for data distribution
Client-server architectures
Separation between the database server and the client
Distributed databases
Several database servers used by the same application
Parallel databases
Several storage devices and processors that operate in parallel to improve
performance
Replicated databases
Same data physically stored at different servers
Data warehouses
servers specialized for the management of data dedicated to decision support
Kinds of architectures
Two kinds of systems
OLTP (On-Line Transaction Processing)
for the optimized management and reliable transactions on database server
specialized on the management of hundreds or even thousands of transactions per second
OLAP (On-Line Analytical Processing)
for data analysis
operate on data warehouse servers, specialized on the management of data for decision support systems
Server typically include both OLAP and OLTP functionalities
Properties of distributed systems
Portability
possibility of transporting programs from an environment to another established at compile time
facilitated by language standards (e.g., SQL-2, SQL-3)
Interoperability
ability of interacting among heterogeneous systems
established at compile time
facilitated by standard protocols for data access (e.g., Database Connectivity (ODBC) e X-Open Distributed Transaction Processing (DTP))
Client-server architectures (1)
Interacting software processes are divided between client (request services) and server (provide services)
Requires a precise definition of service interface, listing precisely the services offered by the server
The client has an active role (makes the request)
The server has a reactive role (replies)
Usually, a client process requests a few services in a sequence to one or more server processes
Usually, a server process replies to multiple requests by different client processes
Client-server architectures (2)
The machine acting as client should
be adequate for interacting with the final user
support productivity tools (email, word processing, Internet access, and workflow management)
The machine acting as server should
have considerable memory capacity (to support buffer management) have considerable disk capacity (to store the whole database)
Client-server architectures (3)
Widely used for databases:
Client and server functions are well defined
Provides a good separation of the management activities client suitable for interaction with the user
server suitable for data management
SQL offers an ideal programming paradigm for the identification of service interfaces
Client-server architectures (4)
SQL offers an ideal programming paradigm for the identification of services interfaces
SQL queries formulated by the client and sent to the server
query results are computed by the server and sent to the client
SQL standardization, portability, and interoperability enable the development of client applications that involve different servers
Client-server architectures (5)
Often the server is multi-threaded:
Behaves as a single process that dynamically operates for different
transactions
Each execution unit of the server process for a given transaction is a thread
Client-server architectures (6)
Servers are processes that are permanently active and check: an input queue for clients requests
an output queue for query results
Often, a dispatcher process distributes requests among servers and returns replies to clients
When dispatchers can dynamically determine the number of active server processes depending on the number of requests, we say that we have a class of servers
Client-server architectures (7)
Input queue
Database Server
Output queue
Server Process
Two-tiered architectures (1)
There are two machines: a client and a server
There are different solutions for deciding to which machine each layer
is assigned
Simple approach:
the client has the user-interface level
the server has the processing and data levels
This is not the only solution
Two-tiered architectures (2)
Fat vs thin client
Fat Clients
+ More efficient for users
+ Provide higher system scalability
More complex and hard to manage
Client software is more error prone and depends on the client machine and its
operating system Thin Clients
+ Easier to manage
Trend to move computation to the client
Multi-tiered architectures
A single server is replaced by multiple servers running on different machines
A server may act as a client
Different programs in the processing level reside on different servers
Distributed databases
System of databases where at least one client interacts with more servers for the execution of an application
Distributed databases: Pros
Respond to applications requirements
organizations have a distributed structure
data distribution permits the management of data where they are generated and used
Flexibility and modularity
can be configured with continuous additions and modifications of
components
Reliability
can react to failures reducing performance instead than stopping operability
Distributed databases: classification (1)
Kind of involved DBMSs
Homogeneous DDB: all the servers use the same DBMS Heterogeneous DDB: the servers use different DBMSs
Kind of network
Local Area Network (LAN) Wide Area Network (WAN)
Distributed databases: classification (2)
Kind of DBMS
Kind of network
Homogeneous
Management and financial applications
Booking systems and financial applications
Heterogeneous
Management and inter-functional applications
Integrated booking systems and inter- banking systems
Local independency and cooperation
The distributed database can be considered, from an abstract point of view, as a single database
It should be designed in such a way to have applications that are executed on a single server, minimizing
the need of interaction
the need of data exchange
Data fragmentation (1)
Adopts algebraic operations on a relation R to divide it into fragments R1, , Rn
Horizontal fragmentation
each Ri has a subset of the tuples in R
each Ri can be interpreted as the result of a selection over R
Vertical fragmentation
each Ri has, in its schema, a subset of the attributes in R
each Ri can be interpreted as the result of a projection over R
Data fragmentation (2)
Correctness property
completeness: each data in R must be represented in one of its fragments Ri
reconstructability: R should be fully reconstructable starting from its fragments
Horizontal fragments are disjoint
do not have common tuples
Vertical fragments include the primary key of R guarantees reconstructablility
Horizontal fragmentation: example (1)
EMPLOYEE(Empnum,Name,Deptnum,Salary,Taxes)
Fragments:
EMPLOYEE1 = Empnum3 EMPLOYEE EMPLOYEE2 = Empnum>3 EMPLOYEE
To reconstruct the relation:
EMPLOYEE = EMPLOYEE1 E EMPLOYEE2
Horizontal fragmentation: example (2)
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
EMPLOYEE1 (Empnum3 EMPLOYEE)
EMPLOYEE2 (Empnum>3 EMPLOYEE)
Administration Production
3.7M 3.5M 5.3M
1.2M 1.1M 2.1M
Vertical fragmentation: example (1)
EMPLOYEE(Empnum,Name,Deptnum,Salary,Taxes) Fragments:
EMPLOYEE1 = OEmpnum,Name(EMPLOYEE)
EMPLOYEE2 = OEmpnum,Deptnum,Salary,Taxes(EMPLOYEE)
To reconstruct the relation:
EMPLOYEE = EMPLOYEE1 EMPLOYEE2
Vertical fragmentation: example (2)
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
EMPLOYEE1 = OEmpnum,Name(EMPLOYEE) EMPLOYEE2 = OEmpnum,Deptnum,Salary,Taxes(EMPLOYEE)
EMPLOYEE1 EMPLOYEE2
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
Allocation schema
Describe the mapping of relations or fragments to servers where they are stored
Each fragment corresponds, at the physical level, to a file and is allocated to a specific server
fragments are stored
the original relation is a view over fragments (virtual)
Allocation can be
non-redundant: each fragment or relation is allocated to one server only
redundant: at least one fragment or relation is allocated to multiple servers
Transparency levels (1)
Distinguishing between fragmentation and allocation permits to write applications operating at different levels
From the most abstract and independent from data fragmentation To the most concrete and dependent on their physical allocation
Transparency levels (2)
fragmentation: the programmer
does not need to know the fragmentation does not need to know the allocation
allocation: the programmer
needs to know the structure of fragments does not need to know the allocation
language: the programmer
needs to know the structure of fragments needs to know the allocation
no transparency
each DBMS accepts its own SQL dialect: the system is heterogeneous and the DBMSs do not support a common interoperability standard
Transparency levels: example
SUPPLIER(Snum,Name,City)
horizontal fragments:
SUPPLIER1 = city=Milan (SUPPLIER) SUPPLIER2 = city=Rome (SUPPLIER)
allocation of horizontal fragments (with replication):
write a procedure that, given a supplier number, returns its name
Fragmentation transparency: example
The programmer
does not need to know the fragmentation does not need to know the allocation
procedure Query1(:snum,:name); select Name into :name from Supplier
where Snum = :snum;
end procedure;
Allocation transparency: example
The programmer
needs to know the structure of fragments
does not need to know the allocation
in case of redundancy does not need to indicate which copy to use for access (replication transparency)
procedure Query2(:snum,:name); select Name into :name from Supplier1
where Snum = :snum;
if :empty then
select Name into :name
from Supplier2
where Snum = :snum;
end procedure;
Language transparency: example
the programmer
needs to know the structure of fragments
needs to know the allocation
in case of replication, needs to indicate which copy to use for access
procedure Query3(:snum,:name); select Name into :name
from where Snum = :snum;
if :empty then
select Name into :name
from where Snum = :snum;
end procedure;
Query optimization
The application can be optimized through:
parallelism
submit requests in parallel in contrast to in sequence reduce the overall response time
knowledge of the logical properties of fragments query the fragment where data reside
increase efficiency but reduce flexibility
Query optimization: example
procedure Query4(:snum,:name,:city); case :city of
select Name into :name from Supplier1
where Snum = :snum;
select Name into :name from Supplier2
where Snum = :snum;
end procedure;
Fragmentation: exercise
SHOP (ID, Name, Kind, Tel, City, IDWarehouse) WAREHOUSE (ID, Street, City, Tel)
WAREHOUSE is split in 4 fragments based on the city where warehouses are located (Milan, Florence, Rome, Venice), allocated at the three corresponding shipment centres in the same cities
SHOP is split in 3 fragments based on the kind of shop (clothes, shoes, bath) allocated at three coordination centres, one for each kind of shop, in Cagliari, Siena, and Bari, respectively
Fragmentation: exercise (continuation)
Write a query that retrieves the name, kind, and telephone number of the shop with ID=123 according to fragmentation, allocation, and language transparency.
Transaction classification
Progressively increasing complexity levels
Remote requests
Remote transactions
Distributed transactions Distributed requests
Remote requests
Read-only transactions including an arbitrary number of SQL queries All the queries are addressed to a single remote DBMS
The remote DBMS can only be queried
Remote transactions
Composed of an arbitrary number of SQL commands (select, insert, delete, update)
All the commands are addressed to a single remote DBMS Each transaction writes one DBMS only
Distributed transactions
Composed of an arbitrary number of SQL commands (select, insert, delete, update) addressed to an arbitrary number of remote DBMSs
Each SQL command refers to a single DBMS
Each transaction can update different DBMSs requires two-phase-commit protocol
Distributed requests
Arbitrary transactions composed of an arbitrary number of SQL commands (select, insert, delete, update) addressed to an arbitrary number of remote DBMSs
Each command can refer to any DBMS
Requires a distributed query optimizer
Transaction: example (1)
CC(Num,Name,Balance)
CC1: Num1000(CC)
CC2: Num>1000(CC)
assume allocation transparency
Distributed transaction
Transfer 100 Euros from account 354 to account 1487
it is necessary to guarantee atomicity: either both updates are executed or none of them
Transaction: example (2)
begin transaction
update CC1
set Balance = Balance 100 where CCNum = 354;
update CC2
set Balance = Balance + 100 where CCNum = 1487;
end transaction
Technology of distributed databases (1)
Data distribution does not affect
consistency: integrity constraints describe only local properties
it is a limit of current DBMS technology
persistency: each system guarantees persistency to locally stored data
local recovery (log, checkpoint, dump) mechanisms Data distribution affects
isolation
atomicity
Technology of distributed databases (2)
Data distribution requires to modify
Query optimization Concurrency control
isolation
Reliability control atomicity
Distributed query optimizer (1)
Under the responsibility of the DBMS that receives the query
Decides how to divide the query in sub-queries, each addressed to a
specific DBMS
Defines a strategy (plan) for distributed execution:
Coordinated execution of different programs at different DBMSs Data exchange among DBMSs
Guarantees global optimization
Distributed query optimizer (2)
In the computation of distributed queries cost, the amount of data transmitted over the network is particularly important
Ctot=CI/OnI/O +CCPUnCPU +Ctrntr
ntr : amount of data transmitted over the network Ctr : transmission cost
Concurrency control
In a distributed system, a transaction ti can execute multiple sub- transactions at different nodes:
tij execution of ti at node j
t1 : r11(x) w11(x) r12(y) w12(y) t2 : r22(y) w22(y) r21(x) w21(x)
Concurrency control: example
S1 : r11(x) w11(x) r21(x) w21(x) S2 : r22(y) w22(y) r12(y) w12(y)
Locally serializable (serial) Globally non serializable
The conflict graph includes a cycle:
on node 1, t1 precedes t2 and is in conflict with t2 on node 2, t2 precedes t1 and is in conflict with t1
of distributed transactions can be compromised by failures/malfunctions
node failure (software/hardware)
message lost: the execution of a protocol is left in a bad state
each message of the protocol (msg) is followed by an acknowledgement of receipt message (ack)
the loss of a message leaves the sender not sure about its reception
failure of connection links: may cause network partitioning
a transaction can be active simultaneously in different sub-networks
Global serializability
Local serializability does not guarantee global serializability
S1 : r11(x) w11(x) r21(x) w21(x) S2 : r22(y) w22(y) r12(y) w12(y)
Locally serializable (serial) Globally non serializable
Global serializability requires the presence of a serial schedule S equivalent to all the local schedules Si resulting at each node
The projection of S on node i must be equal to Si
Global serializability: properties
Conflict-serializable global schedule
guaranteed if each scheduler uses strict 2PL and executes atomic commit when all the sub-transactions at the different nodes have all the resources
Serial global schedule
guaranteed if each distributed transaction acquires a single timestamp and uses it in all its requests to all the schedulers that perform concurrency control based on timestamp
requires the assignment of a global timestamp
Logical clocks
Need to assign timestamps that reflect precedence among events in a distributed system
If two processes do not interact it is not necessary that they are synchronized
No matter that all processes agree on what time it is, but on the order in which events occur
Lamport clocks and vector clocks are based on these observations
Happened-before
aab means that all transactions agree that first a occurs and then b occurs
a and b are in the same transaction and a occurs before b
a is the event of sending a message and b is the event of receiving the
The relationship is transitive
Events in different transactions that do not exchange messages are concurrent
it is a partial order relationship
Lamport clocks
How do we maintain a global view on the systems behavior that is consistent with the happened-before relationship?
Attach a timestamp C(e) to each event e s.t.:
If a and b are two events in the same transaction, and aab, then C(a)
Reviews
There are no reviews yet.