MySQL Group Replication

MySQL Group Replication:
'Synchronous',
multi-master,
auto-everything
Ulf Wendel, MySQL/Oracle
The speaker says...

MySQL 5.7 introduces a new kind of replication:
MySQL Group Replication. At the time of writing
(10/2014)
MySQL Group Replication is available as a preview
release on labs.mysql.com. In common user terms it
features (virtually) synchronous, multi-master, autoeverything replication.
Proper wording...
An eager update everywhere system based
on the database state machine approach
atop of a group communication system
offering virtual synchrony and
reliable total ordering messaging.
MySQL Group Replication offers
generalized snapshot isolation.
The speaker says...

And here is a more technical description....
WHAT ?!
Hmm, how does it compare?
The speaker says...

The technical description given for MySQL Group
Replication may sound confusing because it has
elements from the distributed systems and database
systems theory. From around 1996 and 2006 the two
research communities jointly formulated the
replication method implemented by MySQL Group
Replication.
As a web developer or MySQL DBA you are not
expected to know distributed systems theory inside
out. Yet to understand the properties of MySQL Group
Replication and to get most of it, we'll have to touch
some of the concepts.
Let's see first how the new stuff compares to the
Goals of distributed databases

Availability
Cluster as a whole unaffected by loss of nodes
Scalability
Geographic distribution
Scale size in terms of users and data
Database specific: read and/or write load
Distribution Transparency
Access, Location, Migration, Relocation (while in use)
Replication
Concurrency, Failure
The speaker says...

MySQL Group Replication is about building a
distributed database. To catalog it and compare it
with the existing MySQL solutions in this area, we
can ask what the goals of distributed databases are.
The goals lead to some criteria that is used to give a
first, brief overview.
Goal: a distributed database cluster strives for
maximum availability and scalability while
maintaining distribution transparency.
Criteria: availability, scalability, distribution
transparency.
MySQL clustering cheat sheet
Availability
Scalability
Scale on
WAN
Distribution
Transparenc
y
MySQL
Replication
MySQL
Cluster
MySQL
Fabric
Primary =
SpoF,
no auto failover
Shared
nothing,
auto failover
SpoF
monitored,
auto failover
Reads
Partial
replication,
node limit
Partial
replication,
no node limit
Asynchronous
Synchronous
(WAN option)
Asynchronous
(depends)
R/W splitting
SQL: yes
(low level:
no)
Special clients
No distributed
queries
The speaker says...

Already today MySQL has three solutions to build a
distributed MySQL cluster: MySQL Replication, MySQL
Cluster and MySQL Fabric. Each system has different
optimizations, none can achieve all the goals of a
distributed cluster at once. Some goals are
orthogonal.
Take MySQL Cluster. MySQL Cluster is a shared
nothing system. Data storage is reundant, nodes fail
independently. Transparent sharding (partial
replication) ensures read and write scalability until
the maximum number of nodes is reached. Great for
clients: any SQL node runs any SQL, synchronous
updates become visible immediately everywhere.
But, it won't scale on slow WAN connections.
How Group Replication fits in

Repl
.
Cluster
Group Repl.
Availability
Shared
nothing,
auto failover
Shared
nothing,
auto
failover/join
Scalability
Partial
replication,
node limit
Full replication,
read and some
write scalability
Synchronous
(WAN option)
(Virtually)
Synchronous
SQL: yes
(low level: no)
All nodes run

all SQL
Scale on
WAN
Distribution
Transparen
cy
Fabri
c
The speaker says...

MySQL Group Replication has many of the desireable
properties of MySQL Cluster. Its strong on availability
and client friendly due to the distribution
transparency. No complex client or application logic
is required to use the cluster. So, how do the two
differ?
Unlike MySQL Cluster, MySQL Group Replication
supports the InnoDB storage engine. InnoDB is the
dominant storage engine for web applications. This
makes MySQL Group Replication a very attractive
choice for small clusters (3-7 nodes) running Drupal,
WordPress, in LAN settings! Also, Group
Replication is not synchronous in a technical way. For
practical matters it is.
Group Replication (vs. Cluster)

Availability
Nodes fail independently
Cluster continues operation in case of node failures
Scalability
Geographic distribution: n/a, needs fast messaging
All nodes accept writes, mild write scalability
All nodes accept reads, full read scalability
Distribution Transparency
Full replication: all nodes have all the data
Fail stop model: developer free'd to worry about

consistency
The speaker says...

Another major difference between MySQL Cluster
and MySQL Group Replication is the use of partial
replication versus full replication. MySQL Cluster has
transparent sharding (partial replication) build-in. On
the inside, on the level of so-called MySQL Cluster
data nodes, not every node has all the data. Writes
don't add work to all nodes of the cluster but only a
subset of them. Partial replication is the only known
solution to write scalability. With MySQL Group
Replication all nodes have all the data. Writes can be
executed concurrently on different nodes but each
write must be coordinated with every other node.
time to dig deeper >:).
Eager update everywhere... ?!
A developers categorization...
Where are transactions run?
Eage
r
When does
synchronizati
on happen?
Primary Copy
Update Everywhere
(MySQL semisynch
Replication)
MySQL Cluster
MySQL Group
3rd party: Galera
MySQL
Replication/Fabri
Lazy
c
3rd party:
Tungsten
MySQL Cluster
Replication
The speaker says...

I've described MySQL Group Replication as an
eager update everywhere system. The term comes
from a categorization of different database
replication systems by the two questions:
- where can transaction every be run?
- when are transactions synchronized between
nodes?
The answers to the questions tells a developer which
challenges to expect. The answers determine which
additional tasks an application must handle when its
run on a cluster instead of a single server.
Lazy causes work...

Set price = 1.23
Node
price = 1.23
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
Node
Node
Node
price = 1.00
price = 1.23
price = 0.98
The speaker says...

When you try to scale an application running it on a
lazy (asynchronous) replication cluster instead of a
single server you will soon have users complaining
about outdated and incorrect data. Depending
which node the application connects to after a write,
a user may or may not see his own updates. This can
neither happen on a single server system nor on an
eager (synchronous) replication cluster. Lazy
replication causes extra work for the developer.
BTW, have a look at PECL/mysqlnd_ms. It abstracts
the problem of consistency for you. Things like readyour-writes boil down to a single function call.
Primary Copy causes work...

Read
Write
Primary
Copy
Copy
Copy
Read
Read
Read
The speaker says...

Judging from the developer perspective only, primary
copy is an undesired replication solution. In a
primary copy system only one node accepts writes.
The other nodes copy the updates performed on the
primary. Because of the read-write splitting, the
replication system does not need to coordinate
conflicting operations. Great for the replication
system author, bad for the developer. As a developer
you must ensure that all write operations are
directed to the primary node... Again, have a look at
PECL/mysqlnd_ms.
MySQL Replication follows this approach. Worse,
MySQL Replication is a lazy primary copy system.
Love: Eager Update Everywhere

Read
Write
Node
price = 1.23
price = 1.23
price = 1.23
Node
Node
Write
Read
Write
Read
The speaker says...

From a developer perspective an eager update
anywhere system, like MySQL Group Replication, is
indistinguishable from a single node. The only extra
work it brings you is load balancing, but that is the
case with any cluster. An eager update anywhere
cluster improves distribution transparency and
removes the risk of reading stale data. Transparency
and flexibility is improved because any transaction
can be directed to any replica. (Sometimes
synchronization happens as part of the commit, thus
strong consistency can be achieved.) Fault tolerance
is better than with Primary Copy. There is no single
point of failure a single primary - that can cause a
total outage of the cluster. Nodes may fail
individually without bringing the cluster down
HOW? Distributed + DB?

Database state machine?
The speaker says...

In the mid-1990s two observations made the
database and distributed system theory communities
wondered if they could develop a joint replication
approach.
First Gray et. al. (database communitiy) showed that
the common two-phase locking has an expected
deadlock rate that grows with the third power of the
number of replicas.
Second, Schiper and Raynal noted that transactions
have common properties with group communication
principles (distributed systems) such as ordering,
agreement/'all-or-nothing' and even durability.
Three building blocks

State machine replication
trivial to understand
Atomic Broadcast
database meets distributed systems community
OMG, how easy state machine replication is to

implement!
Deferred Update Database Replication
database meets distributed systems community
how we gain high availability and high performance
what those MySQL Replication team blogs talk

about ;-)
The speaker says...

Finally, in 1999 Pedone, Guerraoui and Schiper
published the paper The Database State Machine
Approach. The paper combines two well known
building blocks for replication with a messaging
primitive common in the distributed systems world:
atomic broadcast.
MySQL Group Replication is slightly different from
this 1999 version, more following a later refinement
from 2005 plus a bit of additional ease-of-use.
However, by end of this chapter you learned how
MySQL Cluster and MySQL Group Replication differ
beyond InnoDB support and built-in sharding.
State machine replication

Input
Set A = 1
Replica
Replica
Replica
Output
Output
Output
A=1
A=1
A=1
The speaker says...

The first building block is trivial: a state machine. A
state machine takes some input and produces some
output. Assume your state machines are
determinisitic. Then, if you have a set of replicas all
running the same state machine and they all get the
same input, they all will produce the same output.
On an aside: state machine replication is also known
as active replication. Active means that every replica
executes all the operations, active adds compute
load to every replica. With passive replication, also
called primary-backup replication, one replica
(primary) executes the operations and forwards the
results to the others. Passive suffers under primary
availability and possibly network bandwith.
Requirement: Agreement
Input
Set A = 1
Replica
Replica
Output
A = NULL
A=1
Replica
The speaker says...

Here's more trivia about the state machine
replication approach. There are two requirements for
it to work. Quite obviously, every replica has to
receive all input to come to the same output. And
the precondition for receiving input is that the replica
is still alive.
In academic words the requirement is: agreement.
Every non-faulty replica receives every request. Nonfaulty replicas must agree on the input.
Requirement: Order
1) Set A = 1
2) Set B = 1
3) Set B = A *2
Input: 1, 2, 3
Input: 1, 3, 2
Input: 3, 1, 2
Replica
Replica
Replica
A=1
A=1
A=1
B=2
B=1
B=1
The speaker says...

The second trivial requirement for state machine
replication is ordering. To produce the same output
any two state machines must execute the very same
input including the ordering of input operations.
The academic wording goes: if a replica processes
requests r1 before r2, then no replica processes
request r2 before r1. Note that if operations
commute, some reording may still lead to correct
output. The sequence A = 1, B = 1, B = A * 2 and
the sequence B = 1, A = 1, B = A * 2 produce the
same output.
(Unrelated here: the database scaling talk touches
the fancy commutative replicated data types Riak
Atomic Broadcast
Distributed systems messaging abstraction
Meets all replicated state machine requirements
Agreement
If a site delivers a message m then every site delivers

m
Order
No two sites deliver any two messages in different

orders
Termination
If a site broadcasts message m and does not fail, then

every site eventually delivers m
We need this in asynchronous enivronments
The speaker says...

State machine replication is the first building block
for understanding the database state machine
approach. The second building block is a messaging
abstraction from the distributed systems world called
atomic broadcast. Atomic broadcast provides all the
properties required for state machine replication:
agreement and ordering. It adds a property needed
for communication in an asynchronous system, such
as a system communicating via network messages:
termination.
All in all, this greatly simplifies state machine
replication and contributes to a simple, layered
design.
Delivery, durability, group

Client
Replica
Replica
Replica
Replica Group
Replica
Replica
Mr. X
Send first, possibly delivered second
The speaker says...

The Atomic broadcast properties given are literally
copied from the original paper describing the
database state machine replication approach. There
is two things in it not explained yet. First, atomic
broadcast defines properties in terms of message
delivery. The delivery property not only ensures total
ordering despite slow transport but also covers
message loss (MySQL desires uniform agreement
here, something better than Corosync) and even the
crash and recovery of processors (durability)! A
recovering processor must first deliver outstanding
messages before it continues. Second, note that
atomic broadcast introduces the notion of a group.
Only (correct) members of a group can exchange
messages.
Client Response
Replica
Agreement
Replica
Execution
Replica
Server Coordination
Client
Client Request
Deferred Update: the best?
Client
Replica
Replica
Replica
The speaker says...

We are almost there. The third building block to the
database state machine replication is deferred
update database replication. The slide shows a
generic functional model used by Pedone and
Schiper in 2010 to illustrate their choice of deferred
update.The argument goes that deferred update
combines the best of the two most prominent object
replication techniques: active and passive
replication. Only the comination of the best from the
two will give both high availability and high
performance.
Translation: MySQL Group Replication can in theory
- have higher overall throughput than MySQL
Replication. Do you love the theory ;-) ? As a DBA
Replica
All reply to client
Replica
Execution
Replica
Requests get ordered
Client
Client sends op to all
Active Replication (SM)
Client
Replica
Replica
Replica
The speaker says...

In an active replication system, a pure state machine
replication system, the client operations are
forwarded to all replicas and each replica individually
executes the operation. The two challenges are to
ensure all replicas execute requests in the same
order and all replicas decide the same. Recall, that
we talk multi-threaded database servers here.
A downside is that every replica has to execute the
operation. If the operation is expensive in terms of
CPU, this can be a waste of CPU time.
Client
Primary
Backup
Backup
Primary replies to client
Primary forwards changes
Only primary executes
Client sends op to primary
Passive Replication
Client
Replica
Replica
Replica
The speaker says...

The alternative is passive replication or primarybackup replication. Here, the client talks to only one
server, the primary. Only the primary server
executes client operations. After computation of the
result, the primary forwards the changes to the
backups which apply tem.
The problem here is that the primary determines the
systems throughput. None of the backups can
contribute its computing power to the overall system
throughput.
Multi-primary (pass.) replication

What we want...
for performance: more than one primary
for scalability: no distributed locking
.. and of course: transactions
Two-staged transaction protocol
Client
Primary
Transaction processing
Primary
Primary
Transaction termination
The speaker says...

Multi-primary (passive) replication has all the
ingredients desired.
Transaction processing is two staged. First, a client
picks any replica to execute a transaction. This
replica becomes the primary of the transaction. The
transaction executes locally, the stage is called
transaction processing. In the second stage, during
transaction termination, the primaries jointly decide
whether the transaction can commit or must abort.
Because updates are not immediately applied,
database folks call this deferred update our last
building block.
Deferred Update DB Replication

Deterministic certification
Reads execute locally, Updates get certified
Certification ensures transaction serializability
Replicas decide independently about certification result
Read
Write
Primary
Primary
Rs/Ws/U
Primary
Primary
The speaker says...

One property of transactions is isolation. Isolation is
also know as serializability: the concurrent execution
of transactions should be equivalent to a serial
execution of the same transactions. In Deferred
Update system, read transactions are processed and
terminated on one replica and serialized locally.
Updates must be certified. After the transaction
processing the readset, writeset and updates are
sent to all other replicas. The servers then decide in
a deterministic procedure whether (one-copy)
serializability holds, if the transaction commits.
Because its a deterministic procedure, the servers
can certify transactions independently!
Options for termination

Atomic Broadcast based
this is what is used, by MySQL, by DBSM
Optimization: Reordering (atop of Atomic

Broadcast)
in theory it means less transaction aborts
Optimization limit: Generic Broadcast based
this has issues, which make it nasty
Atomic Commit based
more transactions than atomic broadcast
The speaker says...

There are several ways of implementing the
termination protocol and the certification. There are
two truly distinct choices: atomic broadcast and
atomic commit. Atomic commit causes more
transaction aborts than atomic broadcast. So, it's out
and atomic broadcast remains.
Atomic broadcast can in theory be further
optimized towards less transaction aborts using
reordering. For practically matters, this is about
where the optimizations end. A weaker (and possibly
faster) generic broadcast causes problems in the
transactional model. For databases, it could be an
over-optimization.
Generic certification test

Transactions have a state
Executing, Comitting, Comitted, Aborted
Reads are handled locally

Updates are send to all replicas
Readset and writeset are forwarded
On each replica: search for 'conflicting'

transactions
Can be serialized with all previous transactions?

Commit!
Commit? Abort local transaction that overlap with

update
The speaker says...

No matter what termination procedure is used, the
basic procedure for certification in the deferred
update model is always the same. Updates/writes
need certification. The data read and the data
written by a transaction is forwarded to all other
replicas.
Every replica searches for potentially 'conflicting'
transactions, the details depend on the termination
procedure. A transaction is decided to commit if it
does not violate serializability with all previous
transactions. Any local transaction currently running
and conflicting with the update is aborted.
Database State Machine

Deferred Update Database Replication as a state
machine
Atomic Broadcast based termination
MySQL
Plugin Services
Plugins
Transaction hooks
MySQL Group Replication
Capture
Apply
Recover
Replication Protocol incl. termination protocol/certifier

Group Communication System
The speaker says...

The Database State Machine Approach combines all
the bits and pieces. Let's do a bottom up summary.
Atomic broadcast not only free's the database
developer to bother about networking APIs it also
solves the nasty bits of communicating in an
asynchronous network. It provides properties that
meet the requirements of the state machine
replication. A deterministic state machine is what
one needs to implement the termination protocol
within deferred update replication. Deferred update
replication does not use distributed locking which
Gray proved problematic and it combines the best of
active and passive replication. Side effects: simple
replication protocol, layered code.
The termination algorithm

Updates are send to all replicas
Readset and writeset are forwarded
Step 1 - On each replica: certify
Is there any comitted transaction that conflicts?

(In the original paper: check for write-read conflicts
between comitting transaction and comitted
transactions using. Does the committing transaction
readset overlap with any comitted transactions
writeset. Works slightly different in MySQL.)
Step 2 On each replica: commitment
Apply transactions decided to commit
Handle concurrent local transactions: remote wins
The speaker says...

The termination process has two logical steps, just
like the general one presented earlier. The very
details of how exactly two transactions are checked
for conflicts in the first step don't matter here.
MySQL Group Replication is using a refinement of the
algorithm tailored to its own needs. As a developer
all you need to know is: a remote transaction always
wins no matter how expensive local transactions are.
And, keep conflicting writes on one replica. It's faster.
The puzzling bit on the slide is the rule to check
check a commiting transaction against any
commited transaction for conflicts. Any !? Not any...
only concurrent.
What's concurrent?
Any other transaction that precedes the current
one
Recall: total ordering
Recall: asynchronous, delay between broadcast and

Broadcast
Delivery
delivery
Replica
Total order
Replica
1
2
1
R
2
The speaker says...

The definition of what concurrent means is a bit
tricky. Its defined through a negation and that's
confusing on the first look but becomes hopefully
clear on the next slide.
Concurrent to a transaction is any other transaction
that does precede it. If we know the order of all
transactions in the entire cluster -, then we can
which transactions precede one another.
Atomic broadcast ensures total order on delivery.
Some implementations decide on ordering when
sending and that number (logical clock) could be be
used. Any logical clock works.
Certify against all previous?

Broadcast:
Transaction 4 is based
on all previous up to 2
Transaction(2)
Replica
Total order
Replica
R
4
Certification
Certification when 4 is delivered:

Check conflicts with trx >2 and trx < 4
The speaker says...

The slide has an example how to find any other
transaction that precedes one. When a transaction
enters the committing state and is broadcasted, the
broadcast includes the logical time (= total order
number on the slide) of the latest transaction
comitted on the replica.
Eventually the transaction is delivered on all sites.
Upon delivery the certification considers all
transactions that happend after the logical time of
the to be certified transaction. All those transactions
precede the one to be certified, they executed
concurrently at different replicas. We don't have to
look further in the past. Further in the past is stuff
that's been decided on already.
TIME TO BREATH
MySQL is different anyway...
The speaker says...

Good news! The algorithm used by MySQL Group
Replication is different and simpler. For correctness,
the precedes relation is still relevant. But it comes
for free...
A developers view on commit

Client
BEGIN COMMIT
Replica Execute
t(3)
Result
Certify
R
Replica
Certify
Apply
The speaker says...

We are not done with the theory yet but let's do
some slides that take the developers perspective.
Assuming you have to scale a PHP application,
assuming a small cluster of a handful MySQL servers
is enough and assuming these servers are co-located
on racks, then MySQL Group Replication is your best
possible choice.
Did you get this from the theory? Replication is
'synchronous'. On commit you wait only for the
server you are connected to. Once your transaction
is broadcasted, you are done. You don't wait for the
other servers to execute the transaction. With
uniform atomic broadcast, once your transaction is
broadcasted, it cannot get lost. (That's why I torture
you with theory.)
MySQL Replication
Client
BEGIN COMMIT
Master execute
OK
Bin log etc.
R
Slave
Fetch
Apply
The speaker says...

If your network is slow or mother earth, the speed of
light and network message round trip time adds too
much too your transaction execution time, then
asynchronous MySQL Replication is a better choice.
In MySQL Replication the master (primary) never
waits for the network. Not even to broadcast
updates. Slaves asynchronously pull changes.
Despite pushing work on the developer this approach
has the downsite that a hardware crash on the
master can cause transaction loss. Slaves may or
may not have pulled the latest data.
MySQL Semi-sync Replication

Client
BEGIN COMMIT
Master Execute
OK
Bin log Wait for first ACK
R
Fetch
Slave
Slave
Fetch
Apply
Apply
The speaker says...

In the times of MySQL 5.0 the MySQL Community
suggested that to avoid transaction loss the master
should wait for one slave to acknowledge it has
fetched the update from the master. The fact that it's
fetched does not mean that it's been applied. The
update may not be visible to clients yet.
It is a back and forth whether database replication
should be asynchronous or not. It depends on your
needs.
Back to theory after this break.
Back to theory!
Virtual Synchrony?
Virtual Synchrony
Groups and views
P1
A turbo-charged veryion of Atomic Broadcast
M1
P2
P3
M3
M2
P4 VC
G1 = {P1, P2, P3}
M4
G2 = {P1, P2, P3, P4}
The speaker says...

Good news! Virtual Synchrony and Atomic Broadcast
are the same. Our Atomic Broadcast definition
assumes a static group. Adding group members,
removing members or detecting failed ones is
covered.
Virtual Synchrony handles all these membership
changes. Whenever an existing group agrees on
changes, a new view is installed through a view
change (VC) event.
(The term 'virtual': it's not synchronous. There is a
delay we don't want to wait for short message
delays. Yet, the system appears to be synchronous to
most real life observers.)
Virtual Synchrony
View changes act as a message barrier
That's a case causing troubles in Two-Phase Commit
P1
P2
P3
P4
VC
M5
M6
M8
M7
G2 = {P1, P2, P3, P4}
G3 = {P1, P2, P3}
The speaker says...

View changes are message barriers. If the group
members suspect a member to have failed they
install a new view.
Maybe the former member was not dead but just too
slow to respond, or disconnected for a brief period.
False alarm. The former member then tries to
broadcast some updates. Virtual Synchrony ensures
that the updates will not be seen by the remaining
members. Furthermore the former member will
realize that it was excluded.
Some GCS implementing virtual synchrony even
provide abstractions that ensure a joining member
learns all updates it missed (state transfer) before it
Auto-everything: failover
MySQL Group Replication has a pluggable GCS API
Split brain handling? Depends onGCS and/or GCS

config
Default GCS is Corosync
MySQL
MySQL
MySQL
MySQL
MySQL
MySQL
The speaker says...

Good news! The Virtual Synchrony group
membership advantages are fully exposed to the
user level: node failures are detected and handled
automatically. PECL/mysqlnd_ms can help you with
the client site. It's a minor tweak to have it
automatically learn about remaining MySQL server.
Expect and update release soon.
MySQL Group Replication works with any Group
Communication system that can be accessed from C
and implements Virtual Synchrony. The default
choice is Corosync. Split brain handling is GCS
dependent. MySQL follows view change notifications
of the GCS.
Auto-everything: joining
Elastic cluster grows and shrinks on demand
State transfer done via asynch replication channel
Donor
State transfer
MySQL
MySQL
MySQL
MySQL
MySQL
MySQL
Joiner
The speaker says...

Good news! When adding a server you don't fiddle
with the very details. You start the server, tell it to
join the cluster and wait for it to catch up. The server
picks a donor, begins fetching updates using much of
the existing MySQL Replication code infrastructure
and that's it.
Back to theory!
Generalized Snapshot Isolation
Deferred Update tweak

Transaction read set does not need to be
broadcasted
Readset is hard to extract and can be huge
Weaker serializability level than 1SR
Sufficient for InnoDB default isolation
Read
Write
Primary
Primary
V/Ws/U
Primary
Primary
The speaker says...

Good news! This is last bit of theory. The original
Database State Machine proposal was followed by a
simpler to implement proposal in 2005. If the
clusters serialization level is marginally lowered to
snapshot, certification becomes easier. Generalized
snapshot isolation can be achieved without having to
broadcast the readset of transactions. Recording the
readset of a transaction is difficult in most existing
databases. Also, readsets can be huge.
Snapshot isolation is an isolation level for multiversion concurrency control. MVCC? InnoDB!
Somehow... Whatever this is the MySQL Group
Replication termination base algorithm.
Snapshot Isolation
Concurrent and write conflict? First comitter wins!
Reads use snapshot from the beginning of the

transaction
First committer
T1
BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1 T1
T2
BEGIN(v1), W(v1, x=2), , , COMMIT?
Concurrent write (version 1)

Conflict (both change x)
T2
The speaker says...

In Snapshot Isolations transactions take a snapshot
when they begin. All reads return data from this
snapshot. Although any other concurrent transaction
may update the underlying data while the
transaction still runs, the change is unvisiable, the
transaction runs in isolation. If two concurrent
transactions change the same data item they
conflict. In case of conflicts, the first comitter wins.
MVCC requires that as part update of an data item its
version is incremented. Future transactions will base
their snapshot on the new version.
The actual termination protocol

Replica
Write(v2, x=1)
R
Certification
Replica
Object
Latest version
13
OK
The speaker says...

Every replica checks the version of a write during
certification. It compares the writes data items
version number with the latest it knows of. If the
version is higher or equal than the one found in the
replicas certification index, the write is accepted. A
lower number indicates that someone has already
updated the data item before. Because the first
comitter must win a write showing a lower version
number than is in the certification index must abort.
(The certification index fills over time and is
truncated periodically by MySQL. MySQL reports the
size through Performance Schema tables.)
Hmm...
Does it work?
It's a preview there are limits

General
InnoDB only
Corosync lacks uniform agreement
No rules to prevent split-brain (it's a preview, you're

allowed to fool yourself if you misconfigure the GCS!)
Isolation level
Primary Key based
Foreign Keys and Unique Keys not supported yet
No concurrent DDL
That's it, folks!

Questions?
The speaker says...

(Oh, a question. Flips slide)
Network messages pffft!
MySQL super hero at Facebook

@markcallaghan Sep 30
For MySQL sync replication, when all commits originate from 1
master is there 1 network round trip or 2?
http://mysqlhighavailability.com/mysql-group-replication-helloworld
@Ulf_Wendel
@markcallaghan AFAIK, on the logical level, there should be
one. Some of your questions might depend on the GCS used.
The GCS is pluggable
@markcallaghan
@Ulf_Wendel @h_ingo Henrik tells me it is "certification based"
so I remain confused
GCS != MySQL Semi-sync

It's many round trips, how many depends on GCS
Default GCS is Corosync, Corosyc is Totem Ring
Corosync uses a privilege-based approach for total

ordering
Many options: fixed sequencer, moving sequencer, ...
Where you run your updates only impacts collision rate
MySQL
Corosync
MySQL
MySQL
Corosync
Corosync
The speaker says...

No Mark, MySQL Group Replication cannot be
understood as a replacement for MySQL Semi-sync
Replication. The question about network round trips
is hard to answer. Atomic Broadcast and Virtual
Synchrony stack many subprotocols together. Let's
consider a stable group, no network failure, Totem.
Totem orders messages using a token that circulates
along a virtual ring of all members. Whoever has the
token, has the priviledge to broadcast. Others wait
for the token to appear. Atomic Broadcast gives us all
or nothing messaging. It takes at least another full
round on the ring to be sure the broadcast has been
received by all. How many round trips are that?
Welcome to distributed systems...
THE END
Contact: ulf.wendel@oracle.com
The speaker says...

Thank you for your attendance!
Upcoming shows:
Talk&Show! - YourPlace, any time

MySQL Group Replication

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MySQL Group Replication

Uploaded by

Copyright:

Available Formats

MySQL Group Replication:

The speaker says...

The speaker says...

The speaker says...

Goals of distributed databases

Cluster as a whole unaffected by loss of nodes

Scale size in terms of users and data

Database specific: read and/or write load

Access, Location, Migration, Relocation (while in use)

The speaker says...

MySQL clustering cheat sheet

The speaker says...

How Group Replication fits in

All nodes run

The speaker says...

Group Replication (vs. Cluster)

Nodes fail independently

Cluster continues operation in case of node failures

Geographic distribution: n/a, needs fast messaging

All nodes accept writes, mild write scalability

All nodes accept reads, full read scalability

Full replication: all nodes have all the data

Fail stop model: developer free'd to worry about

The speaker says...

Eager update everywhere... ?!

The speaker says...

Lazy causes work...

The speaker says...

Primary Copy causes work...

The speaker says...

Love: Eager Update Everywhere

The speaker says...

HOW? Distributed + DB?

The speaker says...

Three building blocks

database meets distributed systems community

OMG, how easy state machine replication is to

Deferred Update Database Replication

database meets distributed systems community

how we gain high availability and high performance

what those MySQL Replication team blogs talk

The speaker says...

State machine replication

The speaker says...

The speaker says...

The speaker says...

Meets all replicated state machine requirements

If a site delivers a message m then every site delivers

No two sites deliver any two messages in different

If a site broadcasts message m and does not fail, then

We need this in asynchronous enivronments

The speaker says...

Delivery, durability, group

The speaker says...

Deferred Update: the best?

The speaker says...

All reply to client

Requests get ordered

Client sends op to all

Active Replication (SM)

The speaker says...

Primary forwards changes

Only primary executes

Client sends op to primary