Information Management
Giovanni degli Studi di databases
Characteristics
Schema oriented
Copyright By Assignmentchef assignmentchef
ACID properties
Keep large amounts of persistent data
Thanks to transactions, they limit the complexity of the management of concurrency
Integration among multiple applications made possible through a shared database
(Mostly) standard relational model
Relational algebra for accessing data
Semi-structured data
(thanks to P. Samara5)
(some examples taken from Database Systems Concepts, Languages and Architectures, P. Atzeni, S. Ceri, S. Paraboschi and R. Torlone, McGraw-Hill)
Semi-structured data
Do not have a schema, or can respect it partly
XML: eXtensible Markup Language
Data properties expressed through markup of XML documents content
A model for the data can be specified (DTD, XML Schema) There are standardized query languages
Data separated from metadata
JSON: JavaScript Object Notation
Derived from JavaScript
Key-value pairs
Data are organized in batches and nested objects
XML syntax
XML files are composed of nested tags Each XML document has a root tag
All elements are enclosed in tags
Elements
Can be extended, can include children elements Have a content (can be empty)
Attributes
Are included in elements, cannot have children
Associate a value to elements without being part of their contents Have limitations
XML example (well-formed)
Namespaces and Schemas
Namespaces avoid ambigui0es for elements/a7ributes names Collec0ons of names iden0fied by URI
Schemas permit to define a model for documents
Tags and their proper0es, cardinali0es, domains, default values A document can be validated w.r.t. a schema
Schemas can be defined with DTD and XML Schema DTD simpler, but XML Schema more flexible
Permits to define elements and their cardinalities, and attributes
Can specify whether it contains children and their number
Types and constraints (required, implied, fixed)
DTD example
format (paperback|hardback) paperback
XML Schema complex elements
Complex elements
Can contain attributes and other elements Can have indicators
any, all, choice, sequence: ordering among elements {min,max}Occurs: min/max number of occurrences, grouping
XML Schema attributes
Attributes
Can have default and fixed values
Can be optional
Can have restrictions (e.g., min/max value)
XML Schema example
Querying XML documents
Idea: XML as a tree structure to be visited
Queries are based on XPath and FLWOR expression
For, Let, Where, Order, Return
Defines path expressions selecAng the objects in a path
Can return a set of nodes, a Boolean value, numbers, strings Paths are evaluated w.r.t. a context node
Current node (.), parent node (..)
Child of a node (/), descendant of a node (//), any node (*) ANribute of current node predicate ([]), posiAon ([n])
XPath examples
doc(books.xml)/List/Book[Publisher= Feltrinelli AND @availability=Y] /Title
doc(books.xml)//Author doc(books.xml)/List/Book[1]/*
XQuery FLWOR expressions
FOR, LET
They declare variables, difference in bindings
It expresses conditions filtering objects produced by LET and FOR
It orders objects produced by LET and FOR
It generates the result of the query (node, forest, or value, and can include element constructors and references to variables)
XQuery example (1)
FOR $lib IN doc(books.xml)//Book
FOR $auth IN $lib/Author
RETURN $auth
LET $lib := doc(books.xml)//Book
RETURN $lib
FOR $lib IN doc(books.xml)//Book
WHERE $lib/Publisher=Bompiani
RETURN $lib
XQuery example (2)
FOR $lib IN doc(books.xml)//Book
WHERE $lib/Publisher=Bompiani
RETURN
ORDER BY(Title ASCENDING)
XQuery example (3)
let $author := doc(catalog.xml)//author
for $name in distinct-values($author/text())
order by $name
return
for $book in doc(catalog.xml)//libro
where $book/author/text()=$name
} return $book/title
XML Vs. DBMS
RDBMS with extended capabilities
Interfaces for translating XML to/from relational
Interfaces for translating XML queries to relational queries (Loose) mapping between schemas
Native XML databases
They lose RDBMS functionalities but natively manage XML structures (e.g.,
element inclusion, ordering among siblings)
Appealing when data are poorly structured (e.g., schelamess)
A bit of history
1970/1980: Relational databases Storage is expensive
Dara are normalized
Data are stored regardless of how they will be used RDBMS become popular
Client/server model
SQL becomes a standard for querying databases
1990: WWW and Internet
2000: Web 2.0
Storage is less expensive
e-Commerce, social mediaadata growaneed to scale with data!
A necessary introduction
NoSQL does not mean Not SQL but it is more likely to stand for not relational
NoSQL is now interpreted as Not Only SQL It permits to use SQL-like queries
NoSQL is not a single product but a collection of diverse, and sometimes related, concepts about data storage and manipulation
NoSQL: definition
There is not a generally accepted definition in the literature
Characteristics of NoSQL
schemaless
not using only SQL
generally open-source (even though the NoSQL notion is also applied to
closed-source systems)
generally driven by the need to run on clusters (but graph databases do not typically fall in this class)
generally not handling consistency through ACID transactions (but graph databases instead do it)
NoSQL models
There exist different kinds of NoSQL systems, and each family presents different variations
The most important families of NoSQL databases are: Key-value
Document-based Column-oriented Graph-based
Aggregate-oriented Vs. graph-based
Aggregate orienta+on
(partly taken from NoSQL Dis*lled: A Brief Guide to the Emerging World of Polyglot Persistence, P. J. Sadalage, M.Fowler, Addison-Wesley)
Impedance mismatch (1)
Difference between the relational model and the in-memory data structures
In-memory structures are more flexible (e.g., they can be nested)
To use more flexible in-memory structures, it is necessary to translate them
in a relational representation
Impedance mismatch more relevant with the development of object-
oriented programming languages
Introduction of object oriented databasesafailure
Impedance mismatch easier to deal thanks to object-relational mapping frameworks
Not a real solution!
Impedance mismatch (2)
Example of single aggregate structure mapped in many relational tables
Scalability issues
Due to the considerable increase in the amount of data to which we assisted in 2000s, scalability is paramount
Vertical scalability: more powerful machinesaexpensive
Horizontal scalability: clustersaless costly and more reliable
Scalability issue: clusters
Clusters are more suitable to the emerging scenarios (e.g., data generated by social networks)
RDBMSs have not been designed to operate on clusters Designed as single-server
Need to think of an alternaDve to RDBMS for data management
Aggregate orientation
Intuition: operate on data in units with a more complex structure than a tuple
Aggregate: a collection of related objects to be treated as a unit goal 1: update aggregates atomically
goal 2: communicate with storage in terms of aggregates
Aggregates fit distributed scenarios
natural unit for sharding and replication (more on this later)
Aggregate orientation (example)
Application: e-commerce, need to store information on customers, products, orders, shipping addresses, billing addresses, payment data
Relational modeling
No data is replicated (normalization)
Referential integrity (foreign key constraints)
DBMS cannot use knowledge of the aggregates for storage
Aggregate orientation (example)
Application: e-commerce, need to store information on customers, products, orders, shipping addresses, billing addresses, payment data
JSON format (excerpt)
Two aggregates: customer and order Customer contains addresses
Order contains payments that contains addresses
To aggregate or not to aggregate?
Data are organized depending on how they will be accessed
Aggregation is not a property of the data, but of how data will be
used by applications
focus on the unit of interaction with the storage
Not always a good idea:
A given aggregate structure can be an obstacle with a given application (-) Fits well with operations on clusters (+)
CAP theorem (1)
Aggregate-oriented databases and ACID properties are not a good match
CAP theorem (E. Brewer, 2000)
Of three properties of shared-data systems (Consistency, Availability and tolerance to network Partitions) only two can be achieved at any given moment in time.
CAP theorem (2)
Consistency
Availability
Par11on Tolerance
CAP theorem (3)
Consistency
Every request receives the correct response
Once data has been written, all future read requests operate on this version of the data
Availability
The data are available and responsive
Each request eventually receives a response
If you can access a node in the cluster, it can read and write data
Tolerance to network partitions
The cluster can survive to partitioning of the network that break the cluster in multiple partitions that cannot communicate with each other
CAP theorem: example (1)
Two nodes with a replica of the data, having value V0 initially
N1 runs algorithm A writing a new value V1 N2 runs algorithm B reading the data value
CAP theorem: example (2)
Ideal immediate propagation
CAP theorem: example (3)
In real world scenarios, propaga1on is not immediate!
CAP theorem: example (4)
If we want a system
highly available
composed of a large number of nodes
where each node resists to network problems
then we have to accept that sometimes N1 might see V1
N2 might see V2
CAP theorem: how to deal with it? (1)
Renounce to tolerance to network partitions
Single server CA system
Resistance to network partitioning not requested A single machine cannot partition
CA cluster
If a partition ever happens, all the nodes in the cluster go down (no impact on
the CAP theorems definition of Availability) Hard to guarantee
Operate on a single node
CAP theorem: how to deal with it? (2)
Renounce to consistency or availability
Solution adopted by NoSQL systems
They trade-off between consistency and availability
Not always a Boolean decision, trade a little consistency for a little availability How much you can trade depends on the specific application domain
Book a room via hotel booking web site replicated at two locations C:bothnodesagreeontheserializationofrequests
NetworkpartitioningwouldcompromiseA
IncreaseA:master-slaveapproach
FurtherincreaseA:stillacceptbookinglocally,resolveincaseoflastroom
CAP theorem: how to deal with it? (3)
NoSQL is a varied world
In general aggregate-oriented databases support atomic manipula:on of a
single aggregate at a :me
Atomic manipula:on of mul:ple aggregatesamanaged in the applica:on
Considera:ons on where atomicity is wanted is part of the strategy to divide data into aggregates
CAP theorem: conclusions
Better think about the tradeoff consistency and latency
We can always improve consistency by involving more nodes, but
each node increases the response time
We can then think of availability as the limit of latency that we can
tolerate: once latency gets too high, we consider the data unavailable
Every system should be designed to ensure both C and A in normal situations
When a partition occurs, the decision between C and A can be taken
When the partition is resolved the system takes corrective actions coming back to work in normal situation
From ACID to BASE properties
BASE: Basically Available, Soft state, Eventual consistency
Basically available: the system is available most of the time and there
could exist a subsystem temporarily unavailable
Soft state: data are volatile in the sense that their persistence is in the hand of the user that must take care of refreshing them
Eventual consistency: the system eventually converges to a consistent state (lazy updates, already discussed for distributed databases)
Key-Value databases (1)
Used when all accesses to the database are via a primary key, associative arrays
Think of a relational table with two attributes: ID: has unique values
NAME: attribute of data type string
The client can:
Get the value of a key
Set a value for a key
Delete a key (and its value)
Simple hash tableaCost of operations O(1)
Examples: Riak, Redis,B, Amazon DynamiDB, Voldemort (used by LinkedIn), MemcacheDB,
Key-Value databases (2)
Using a simple hash table, keys are stored in buckets
The value associated with a key is an obscure blob, which can hardly be analyzed by the DBMS
Can be any kind of data structure (e.g., sets, hashes, etc.) Can be elaborated by applicaDons
Very widely used for their simplicity and scalability
CAP: some systems permit to set values for N: number of nodes for replica storage
R: number of nodes to be read for a successful read
W: number of nodes to be wriLen for a successful write (relevant for consistency)
Key-Value features
Consistency: applies only for operations on a single key Riak implements eventual consistency
Transactions: some key-value stores have a loose definition of transaction, with however no guarantee on write operations
Structure of data: the value associated with a key can have arbitrary structure (XML, JSON, text, )
Scaling: key-value systems scale well
Some products apply sharding (i.e., the key value determines the node in the
cluster where data are stored)
Query: can operate only on the key
Key-Value: use cases
Suitable for:
Store session information
User profiles and user preferences Shopping cart data
Not suitable for:
Relationships among data
Multioperation transactions
Query by data
Operations by sets
Document databases (1)
Document databases store documents in the value part of the key- value store
Key-value stores where the value is examinable.
The database stores and retrieves documents
The documents stored are similar to each other but do not have to be exactly the same
Examples: MongoDB, CouchDB, Terrastore, OrientDB, RavenDB
Document databases (2)
There are no empty/null a0ributes: if an a0ribute is not found, it was not set or not relevant to the document
New a0ributes can be created without the need to define them or to change the exis?ng documents
Embedding child documents as sub- objects inside documents provides for easy access and be0er performance
{ firstname: Martin,
likes: [ Biking,
Photography ],
lastcity: Boston,
lastVisited:
firstname: Pramod,
citiesvisited: [ Chicago, London, Pune],
addresses: [
{ state: AK,
city: DILLINGHAM,
type: R
{ state: MH,
city: PUNE,
type: R }
lastcity: Chicago
Document database features (1)
Transactions: transactions operating on a single document are generally considered atomic
Availability: document databases try to improve on availability by replicating data using the master-slave setup
Document database features (2)
Consistency: MongoDB uses replica sets for managing consistency and tuning it
Wait, for each write operation, that it is propagated to a given number of slaves
The higher the number of slaves, the higher
consistency guarantees
Document database features (3)
Query: can query the data inside the document without having to retrieve the whole document by its key
MongoDB has a query language which is expressed via JSON CouchDB supports (materialized and dynamic) views
Scaling: depending on which kind of action is growing
Increase the number of slave copies to support more reads
Shard (i.e., horizontally fragment) and properly allocate data to support more writes
Each shard can be a replica set to improve reads within
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.