Lecture 13-14 Big Data and CouchDB
Luca Morandini Cloud Architect Melbourne eResearch Group University of Melbourne
Outline of this Lecture
Copyright By Assignmentchef assignmentchef
Part 1: Big data challenges and architectures
DBMSs for distributed environments
What distributed DBMSs look like: MongoDB vs CouchDB
Consistency and availability in distributed environments
MapReduce algorithms
Sharding
Part 2: Introduction to CouchDB
Managing documents
HTTP API
Queries (Views, Mango and Full-text Search)
Part 3: Workshop on CouchDB
Setting up a 3-node cluster with Docker
Storing and retrieving data using CouchDB
Part 1: Big data Challenges and Architectures
Big data Is Not Just About Bigness
The four Vs :
Volume: yes, volume (Giga, Tera, Peta, ) is a criteria, but
not the only one
Velocity: the frequency at which new data is being brought into
the system and analytics performed
Variety: the variability and complexity of data schema. The
more complex the data schema(s) you have, the higher the probability of them changing along the way, adding more complexity.
Veracity: the level of trust in data accuracy (provenance); the more diverse sources you have, the more unstructured they are, the less veracity you have.
Big Data Calls for Ad-hoc Solutions
While Relational DBMSs are extremely good at ensuring consistency, they rely on normalized data models that, in a world of big data (think about Veracity and Variety) can no longer be taken for granted.
Therefore, it makes sense to use DBMSs that are built upon data models that are not relational (relational model: tables, columns and relationships amongst tables -that is, relational algebra).
While there is nothing preventing SQL to be used in distributed environments, alternative query languages have been used for distributed DBMSs, hence they are sometimes called NoSQL
DBMSs for Distributed Environments
A key-value store is a DBMS that allows the retrieval of a chunk of data given a key: fast, but crude (e.g. Redis, RocksDB, Berkeley DB)
A BigTable DBMS stores data in columns grouped into column families, with rows potentially containing different columns of the same family (e.g. Apache Cassandra, Apache Accumulo)
A Document-oriented DBMS stores data as structured documents, usually expressed as XML or JSON (e.g. Apache CouchDB, MongoDB)
A Tale of Two Clusters
Distributed databases are run over clusters, that is, sets of connected computers
Clusters are needed to:
Distribute the computing load over multiple computers, e.g.
to improve availability
Storing multiple copies of data, e.g. to achieve redundancy
Consider two document-oriented DBMSs (CouchDB and MongoDB) and their typical cluster architectures
CouchDB Cluster Architecture
All nodes answer requests (read or write) at the same time
Sharding (splitting of data across nodes) is done on every node
When a node does not contain a document (say, a document of Shard A is requested to Node 2), the node requests it from another node (say, Node 1) and returns it to the client
Nodes can be added/removed easily, and their shards are re-balanced automatically upon addition/deletion of nodes
In this example there are 3 nodes, 4 shards and a replica number of 2
MongoDB Cluster Architecture
Sharding (splitting of data) is done at the replica set level, hence it involves more than one cluster (a shard is on top of a replica set)
Only the primary node in a replica set answers write requests, but read requests can depending on the specifics of the configuration- be answered by every node (including secondary nodes) in the set
Updates flow only from the primary to the secondary
If a primary node fails, or discovers it is connected to a minority of nodes, a secondary of the same replica set is elected as the primary
Arbiters (MongoDB instances without data) can assist in breaking a tie in elections.
Data are balanced across replica sets
Since a quorum has to be reached, it is better
to have an odd number of voting members
MongoDB vs CouchDB Clusters
MongoDB clusters are considerably more complex than CouchDB ones
MongoDB clusters are less available, as by default only primary nodes can
talk to clients for read operations, (and exclusively so for write operations)
MongoDB software routers (MongoS) must be embedded in application
servers, while any HTTP client can connect to CouchDB
Losing two nodes out of three in the CouchDB architecture shown, means losing access to between one/quarter and half the data, depending on the
nodes that fail
Depending on the cluster configuration parameters and the nature (primary or
secondary) of the lost nodes, losing two nodes in the MongoDB example may imply losing write access to half the data (although there are ten nodes in the cluster instead of three), and possibly read access too,
These differences are rooted in different approaches to an unsolvable problem, a problem defined by Brewers CAP Theorem
Consistency, Availability, Partition-Tolerance
Consistency: every client receiving an answer receives the same answer from all nodes in the cluster
Availability: every client receives an answer from any node in the cluster
Partition-tolerance: the cluster keeps on operating when one or more nodes cannot communicate with the rest of the cluster
Brewers CAP Theorem
Consistency, Availability and Partition-Tolerance: pick any two
Consistency
Availability
Partition- Tolerance
Note: the intersection of the three sets is empty
Brewers CAP Theorem
While the theorem shows all three qualities are symmetrical, Consistency and Availability are at odds only when a Partition happens
Hard network partitions may be rare, but soft ones are not (a slow node may be considered dead even if it is not); ultimately, every partition is detected by a timeout
Can have consequences that impact the cluster as a whole, e.g. a distributed join is only complete when all sub-queries return
Traditional DBMS architectures were not concerned with network partitions, since all data were supposed to be in a small, co-located cluster of servers
The emphasis on numerous commodity servers, can result in an increased
number of hardware failures
The CAP theorem forces us to consider trade-offs among different options
CAP Theorem and the Classification of Distributed Processing Algorithms
Two-phase commit
Consistency
Availability
Partition- Paxos, Tolerance
Multi-Version Concurrency Control
Consistency and Availability: Two phase commit
This is the usual algorithm used in relational DBMSs (and MongoDB, to same extent), it enforces consistency by:
locking data that are within the transaction scope
performing transactions on write-ahead logs
completing transactions (commit) only when all nodes in the cluster have
performed the transaction
aborts transactions (rollback) when a partition is detected
This procedure entails the following:
reduced availability (data lock, stop in case of partition)
enforced consistency (every database is in a consistent state, and all are left
in the same state)
Therefore, two-phase commit is a good solution when the cluster is co-located, less so when it is distributed
Consistency and Partition-Tolerance: Paxos
This family of algorithms is driven by consensus, and is both partition-tolerant and consistent
In Paxos, every node is either a proposer or an accepter :
a proposer proposes a value (with a timestamp)
an accepter can accept or refuse it (e.g. if the accepter receives a more
recent value)
When a proposer has received a sufficient number of acceptances (a quorum
is reached), and a confirmation message is sent to the accepters with the
agreed value
Paxos clusters can recover from partitions and maintain consistency, but the
smaller part of a partition (the part that is not in the quorum) will not send
responses to clients, hence the availability is reduced
Raft is a similar, but simpler algorithm solving the same problem
Availability and Partition-tolerance: Multi-Version
Concurrency Control (MVCC)
MVCC is a method to ensure availability (every node in a cluster always accepts requests), and some sort of recovery from a partition by reconciling the single databases with revisions (data are not replaced, they are just given a new revision number)
In MVCC, concurrent updates are possible without distributed locks (in optimistic locking only the local copy of the object is locked), since the updates have different revision numbers; the transaction that completes last will get a higher revision number, hence will be considered as the current value.
In case of cluster partition and concurrent requests with the same revision number going to two partitioned nodes, both are accepted, but once the partition is solved, there would be a conflict a conflict that would have to be solved somehow (CouchDB returns a list of all current conflicts, which are then left to be solved by the application). Think of it a something similar to a software revision control system such as Git.
Extra: The Peculiar Case of the Blockchain
Blockchains can be described as distributed, inalterable, verifiable, databases. So, how do they map into this classification? (To fix ideas, lets focus just on the Bitcoin distributed ledger.)
Bitcoin works on a cluster of peer-to-peer nodes, each containing a copy of the entire database, operated by different -possibly malicious- actors.
Since new nodes can enter the system at any time, and every node has the entire database, availability is not an issue even in case of a partition, but consistency cannot be assured, since you cannot trust a single node.
To achieve consistency, Bitcoin uses a form of MVCC based on proof-of-work (a proxy for the computing power used in a transaction) and on repeated confirmations by a majority of nodes of a history of transactions.
Bitcoin database security is guaranteed by the impossibility of a single actor having enough computing power to alter the history of transactions (with 6 confirmations, an actor that controls 18% of the computing power has just a 1% probability of compromising a legitimate transaction)
Extra: Why Blockchain does not Help Us
Blockchains are very inefficient by design:
proof-of-work wastes computing power and it is slow
every node contains a copy of the entire database
Consider the cost of a BitCoin transaction (about 2 USD, down from 59 USD a year ago) and the time it takes (which can take anything between 10 minutes and a couple hours)
By contrast, consider the cost of a bank transaction in the SEPA system (the European Union interbank payment system): 30 to 50 cents, and it is done seconds
The difference is that SEPA (and the distributed databases described here) assume that no node can be controlled by a malicious actor bent on altering the database
In conclusion: BlockChain is solution to a very narrow problem (securing transactions on a public network with potentially malicious actors) which is great, but it is not our kind of problem
MongoDB vs CouchDB Clusters
While CouchDB uses MVCC, MongoDB uses a mix of two-phase commit (for replicating data from primary to secondary nodes) and Paxos-like (to elect a primary node in a replica-set)
From the MongoDB 4.4. Documentation: < >
The different choices of strategies explains the different cluster architectures of these two DBMSs
Why Document-oriented DBMS for Big data?
While Relational DBMSs are extremely good for ensuring consistency and availability, the normalization that lies at the heart of a relational database model implies fine-grained data, which are less conducive to partition-tolerance than coarse-grained data.
A typical contact database in a relational data model may
include: a person table, a telephone table, an email table and
an address table, all linked to each other.
The same database in a document-oriented database would
entail one document type only, with telephones numbers, email addresses, etc., nested as arrays in the same document.
Sharding is the partitioning of a database horizontally, i.e. the database rows (or documents) are partitioned into subsets that are stored on different servers. Every subset of rows is called a shard.
Usually the number of shards is larger than the number of replicas, and the number of nodes is larger than the number of replicas (usually set to 3)
The main advantage of a sharded database lies in the improvement of performance through the distribution of computing load across nodes. In addition, it makes it easier to move data files around, e.g. when adding new nodes to the cluster
The number of shards that split a database dictates the (meaningful) number of nodes: the maximum number of nodes is equal to the number of shards (lest a node contains the same shard file twice)
There are different sharding strategies, most notably:
Hash sharding: to distribute rows evenly across the cluster
Range sharding: similar rows (say, tweets coming for the same area)
are stored on the same shard
Replication and Sharding
Replication is the action of storing the same row (or document) on different nodes to make the database fault-tolerant.
Sharding is the partition of data into different buckets
Replication and sharding can be combined with the objective of maximizing availability while maintaining a minimum level of data
A bit of nomenclature (CouchDB-specific, but the concepts can be
generalized to other systems):
n is the number of replicas (how many times the same data
item is repeated across the cluster)
q is the number of shards (how many files a database is split)
n * q is the total number of shard files distributed in the different nodes of the cluster
How Shards Look Like in CouchDB
| 00000000-1fffffff
| ` test.1520993373.couch | 20000000-3fffffff
| ` test.1520993373.couch | 60000000-7fffffff
| ` test.1520993373.couch | 80000000-9fffffff
| ` test.1520993373.couch | c0000000-dfffffff
| ` test.1520993373.couch ` e0000000-ffffffff
` test.1520993373.couch
| 20000000-3fffffff
| ` test.1520993373.couch | 40000000-5fffffff
| ` test.1520993373.couch | 80000000-9fffffff
| ` test.1520993373.couch | a0000000-bfffffff
| ` test.1520993373.couch ` e0000000-ffffffff
` test.1520993373.couch
This is the content of the data/shards directory on a node of a three-node
The test database has q=8, n=2,
16 shards files
The *.couch files are the actual files
where data are stored
The sub-directories are named after the
document _ids ranges
| 00000000-1fffffff
| ` test.1520993373.couch | 40000000-5fffffff
| ` test.1520993373.couch | 60000000-7fffffff
| ` test.1520993373.couch | a0000000-bfffffff
| ` test.1520993373.couch | c0000000-dfffffff
` test.1520993373.couch
Partitions in CouchDB (not network
partitions!)
A partition is a grouping of logically-related rows in the same shard (for instance, all the tweets of the same user)
Partitioning improves performance by restricting queries to a narrow set of documents within a single shard
To be effective, partitions have to be relatively small (certainly smaller than a shard)
A database has to be declared partitioned during its creation
Partitions are a new feature of CouchDB 3.x
MapReduce Algorithms
This family of algorithms, pioneered by Google, is particularly suited to parallel computing of the Single-Instruction, Multiple-Data type (see Flynn).
The first step (Map), distributes data across machines, while the second
(Reduce) hierarchically summarizes them until the result is obtained.
In between reduce steps, shuffling of data across nodes may happen.
Apart from parallelism, its advantage lies in moving the process to where data
are, greatly reducing network traffic.
Example (fiord count):
function map(document):
for each word w in document:
emit (w, 1)
function reduce(word, partialCounts):
for each pc in partialCounts:
emit (word, sum)
Sorting Cards with MapReduce-like Algorithm
Map cards by suite
Part 2: Introduction to CouchDB
Why Using CouchDB in This Course?
Is open-source, hence you can peruse the source code and see how things work
It has MapReduce queries, hence you can understand how this programming paradigm works
It is easy to setup a cluster
It has sharding, replication, and partitions (not to be
mistaken with network partitions)
The HTTP API makes it easy to interact with it
CouchDB Main Features
The main features of CouchDB 3.x are:
Document-oriented DBMS, where documents are expressed in JavaScript
Object Notation (JSON)
HTTP ReST API (more on ReST in later lectures!)
Web-based admin interface
Web-ready: since it talks HTTP and produces JSON (it can also produce
HTML or XML), it can be both the data and logic tier of a three-tier application,
hence avoiding the marshaling and unmarshaling of data objects
Support for MapReduce algorithms, including aggregation at different levels
JavaScript as the default data manipulation language
Full-text search
Support of MongoDB query language
Support of replication
Support of partitions
Support of sharding
Support of clusterized databases
Fauxton User Interface
Typing http://
Create/delete databases
Edit documents
Edit design documents
Run views (MapReduce)
Run Mango queries
Run full-text searches
Modify the configuration
Manage users
Set up replications
A CouchDB instance can have many databases; each database can have its own set of functions (grouped into design documents), and can be split in different shards
Adding and deleting a database is done through a HTTP call:
curl -X PUT http://localhost:5984/exampledb curl -X DELETE http://localhost:5984/exampledb
Listing all databases of an instance is even simpler
curl -X GET http://localhost:5984/_all_dbs
every responses body is a JSON object:
[exampledb, twitter, instagram]
In every CouchDB instance there are system databases. These are prefixed by underscore, such as _users
Insertion and retrieval of documents
To insert a document:
curl -X POST http://localhost:5984/exampledb header Content- Type:application/json data {type: account, holder: Alice, initialbalance: 1000}
Response: 201 (202 if fewer than the prescribed number of write operations were successfully performed)
{ok:true,id:c43bcff2cdbb577d8ab2933cdc0011f8,rev:1- b8a039a8143b474b3601b389081a9eec}
To retrieve a document:
curl -X GET http://localhost:5984/exampledb/c43bcff2cdbb577d8ab2933cdc0011f8
Response: 200 {_id:c43bcff2cd577d8ab2933cdc0011f8,_rev:1- b8a039a8143b474b3601b389081a9eec,type:account,holder:Alice,initial balance:1000}
System Properties of Documents
_id: is the ID of a single document which can be set during the document load; by default it is generated by CouchDB and guaranteed to be unique
_rev: revision number of a document. It is guaranteed to be increasing per- document, i.e. every database instance will pick up the same revision as the current version of the document
Request to name the ID of a new document (note PUT instead of POST):
curl -X PUT http://localhost:5984/exampledb/charlie header
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.