You are on page 1of 32

A Beginner's Guide to Paxos

Dagang Wei, 2016-07-05


What is distributed consensus?

x=3 x=3

x=3

Agreement among a group of processes for a single value.


Why does consensus matter?

We want to avoid single point of failure by


x=3 x=3 making a reliable logical component out of
unreliable subcomponents. This is one of the
fundamental problems in distributed systems -
reliability.

But the hard part is synchronization among the


x=3 subcomponents. Consensus protocol solves the
problem.
What is consensus protocol?

x=2 ? x=5 ? consensus protocol x=... x=3

x=3 ? x=3

Initially, processes may propose different values, a consensus protocol drives them towards an agreement
by defining how processes interact with each other. Processes are not competing but collaborating, they are
happy with any of the proposals. A minority of the processes may fail and restart.
Properties of consensus protocols

Non-triviality: Only proposed values can be chosen.

Safety: At most one value can be chosen.

Liveness: Eventually a value will be chosen if sufficient processes remain non-faulty.


Applications - state machine replication

State State Inputs...

Outputs...

State
Inputs...
Applications - distributed key-value store

x=... x=... Set x = 123

8:00 AM

SUCCESS

x=... Get x
8:01 AM

123
Applications - distributed sequence number generator

seq=... Generate next seq num


seq=...

236

seq=... Generate next seq num

237
Separating the roles of Proposer and Acceptor

Proposer

Acceptor

Process

- Proposers propose values, acceptors


keep track of states

- A process may play one or both roles

- One or multiple proposers propose


values, all acceptors accept values

- If a value is accepted by a quorum of


acceptors, it's chosen.
Quorum as a logical unit of acceptor for Choose operation

Acceptors are collected into groups called quorums, e.g., {A, B}, {B, C}, {A, C} are 3 quorums of a cluster of 3
acceptors {A, B, C}.

A proposer arbitrarily chooses a quorum of acceptors to communicate with. Any message sent to an acceptor must
be sent to a quorum of acceptors, a message received from an acceptor will be ignored unless a copy is received
from each acceptor of a quorum. This essentially makes a quorum as a logical unit of acceptor for Choose
operation.

A value is chosen when it is accepted by a quorum.

Quorum
Simple scenario - no concurrent proposals

Propose x = 3

x=3 x=3 x = ...


Another simple scenario - concurrent proposals of the same value

This is a nice property that we


can leverage. If a proposer
knows the proposal of another
proposer, they can converge
in order to reach consensus
Propose x = 3 Propose x = 3 faster.

x=3 x=3 x=3


Complex scenario - concurrent proposals of different values

How do we ensure the safety


Propose x = 3 Propose x = 4
and liveness properties in the
face of concurrent proposers?

x=3 x = ... x=4

Consensus is really about dealing with concurrency!


How do we usually deal with concurrency? Locking!

Acceptors:
- each maintains a lock
- only accepts the proposal from the lock
holder
- once a proposal has been accepted,
Propose x = 3 rejects any further proposals

Proposers:
- acquire a quorum of locks before
proposing
x=3 x=3 x = ...
What if the lock holder fails before proposing?

It blocks other proposers to


propose values. This is
unacceptable as it breaks the
liveness property.

x = ... x = ... x = ...


Dealing with lock holder failure

Key observations:
- We must avoid permanent
locking (liveness property), let
other proposers to continue the
work.

- If a new proposer sees a value has


been accepted, it must propose the
same value (safety property),
irrespective of their original values.
In the example, the red or green
proposers must continue to
x = ... x = ... x = ... propose x = 3. This rule applies
irrespective of whether the original
proposer is still live or not.
Lock alternative - Optimistic concurrency control
"While running, transactions use data resources without acquiring locks on those resources. Before
committing, each transaction verifies that no other transaction has modified the data it has read. If
the check reveals conflicting modifications, the committing transaction rolls back and can be
restarted." -- Wikipedia "Optimistic concurrency control"
Optimistic concurrency control variant 1 - version number generated by the resource

Read, Update, Write

1. What is your current version number and value?

2. Current (version number, value)


3. Save the current
version number as the
base version number,
compute a new value 4. Update (base version number, new value)
based on the current 5. Update the value and
value increase the version
number iff the base
6. SUCCESS / DENIAL
version number matches
7. If denied, restart from the current version
step 1 number; deny otherwise
Optimistic concurrency control variant 2 - version number generated by the updaters

Read, Update, Write


1. Generate a version
number v larger than
previous ones.
2. Update your version number to v
3. Update iff v is larger
than the current version
number
4. Current (version number, value)
5. If v is equal to the
current version number,
continue to compute a
new value based on the 6. Update (base version number=v, new value)
returned value; otherwise, 7. Update the value iff the
restart from step 1 base version number
matches the current
8. SUCCESS / DENIAL
version number, deny
9. If denied, restart from otherwise
step 1

Assigning an updater generated version number to the resource is like putting a "preemptible lock" on it
Paxos - The core ideas
- Optimistic concurrency control (variant 2). Hold a "preemptible lock" first, try update,
restart on denial.

- Quorum as a logical unit of acceptor for Choose operation. A value is chosen iff it's
accepted by a quorum. This implies the Choose operation is atomic, it's all or nothing, it's
either accepted by a quorum or isn't.
Paxos - The protocol
Phase 1a: Prepare - A proposer creates a proposal identified with a number N. This ID must be greater
than any previous proposal ID used by this Proposer. Then, it sends a Prepare message containing this
proposal to a quorum of acceptors.

Phase 1b: Promise - If N is higher than any previous proposal ID received from any proposer by the
acceptor, then the acceptor must return a promise to ignore all future proposals having an ID less than N.
If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal ID
and previous value in the response. Otherwise, send a denial (Nack) response which tells the proposer that
it can stop its attempt to create consensus with proposal N.

Phase 2a: Accept Request - If a proposer receives enough promises from a quorum of acceptors, it needs
to set a value to its proposal. If any acceptors had previously accepted any proposal, it must set the value
of its proposal to the value associated with the highest proposal ID reported by the acceptors. If none of
the acceptors had accepted a proposal up to this point, then the proposer may choose any value for its
proposal. The proposer sends an Accept Request message to a quorum of acceptors with the chosen
value for its proposal.

Phase 2b: Accepted - If an acceptor receives an Accept Request message for a proposal N, it must accept
it if and only if it has not already promised to any prepare proposals having an ID greater than N. In this
case, it should register the corresponding value v and send an Accepted message to the proposer.
Proposal ID generation

Round number Proposer ID

To satisfy the "globally unique" and "monotonically increasing" properties, we compose a proposal ID from 2
parts: round number and proposer ID. The major part round number starts from 1 for each proposers, the
minor part is the proposer ID.

For example, in the first round, proposer 3 will use the proposal ID 1.3. If the proposal is accepted by an
acceptor first, proposer 2 observed it, it can then generate the proposal ID 2.2 for its own proposal to take
precedence over 1.3.
Paxos - Message flow

1. Generate the next


proposal ID N.

1. Prepare(N)
4. If a quorum of
acceptors promised,
generate an Accept 2. Update max_proposal_id if N
Request with: 3. Promise(N, accepted_proposal_id, accepted_value) / NACK is greater the current, return the
- ID: N currently accepted proposal if
- value: the value any.
associated with the the
highest proposal number 4. Accept(N, value)
if any, or its own value 5. Accept iff N >=
otherwise max_proposal_id; deny
6. Accepted(N, value) / NACK otherwise.

7. If a quorum of
acceptors accepted the
value, then the value is
chosen.
Case 1 - A value already been chosen (new proposer will always see it)

Prepare (1.1) Accept (1.1, x=3)

Prepare (1.1) Accept (1.1, x=3) Prepare (1.2) Accept (1.2, x=3)

Prepare (1.2) Accept (1.2, x=3)

When proposer-green prepares, it will always notice x=3 has been accepted, it must then propose the same value. This
ensures the safety property. Moreover, in case a minority of acceptors fail after a value is chosen, new proposer will still
propose the same value.
Case 2 - No value has been chosen, but new proposer sees the previously accepted

Prepare (1.1) Accept (1.1, x=3)

Prepare (1.1) Accept (1.1, x=3) Prepare (1.2) Accept (1.2, x=3)

Prepare (1.2) Accept (1.2, x=3)

When proposer-green prepares, x=3 hasn't been chosen, but since it noticed x=3 has been accepted by an acceptor, it will
then propose the same value to converge with the concurrent proposer.
Case 3 - No value has not chosen, but new proposer doesn't see the previously accepted

Prepare (1.1) Accept (1.1, x=3)

Prepare (1.1) Prepare (1.2) Accept (1.1, x=3) Accept (1.2, x=4)

Prepare (1.2) Accept (1.2, x=4)

Before proposer-blue requests accept x=3, its proposal ID 1.1 has been preempted by proposer-green's 1.2, hence its accept
request is rejected.
Livelock problem
1. prepare (1.1) 2. accept (1.1, x=3)

1. prepare (1.1) 2. prepare (1.2) 3. accept (1.1, x=3) 4. prepare (2.1, x=4) 5. accept (1.2, x=3) ......

1. prepare (1.2) 2. accept (1.2, x=4)

When there are competing proposers, Paxos may take infinite rounds to reach consensus in theory, this is called Livelock, but
liveness property still holds. One practical solution is to use randomized delay before restarting.
Proof of safety property
Assume x=3 has been chosen (accepted by a quorum of acceptors), let's prove it's impossible for
one of the acceptors to be overwritten by another proposal x=4 later on.

Proof: To overwrite an already chosen value in one of the acceptors, the new proposal must have a
higher proposal ID than the previously accepted one. Let's focus on the Prepare phase of the new
proposal: 1) The new Prepare must be after the previous Accept, otherwise x=3 could not have
been accepted as its proposal ID has been preempted; 2) This implies the new proposer must have
seen the value x=3 is accepted, the only possible reason for it to propose x=4 instead of x=3 is that
there's another acceptor who has accepted x=4 with a higher proposal ID than the proposal ID of
x=3; 3) But this implies a quorum of acceptors had promised to the higher proposal ID, which in
turn implies x=3 could not have been chosen. So, there's no chance to overwrite an already chosen
value.
Paxos - Revisiting the key points

Q: In phase 2a, if any acceptors had previously accepted any proposal, why
must the proposer set the value of its proposal to the value associated with
the highest proposal ID?

A: Let's say proposal (ID=n, value=v) is the previously accepted proposal


with the highest proposal ID, it implies a quorum of acceptors have promised
to Prepare(n), which in turn implies any other proposal (ID=m, value=w)
where m < n has no chance to be accepted, we can safely ignore it, and only
care about v.
Paxos - Revisiting the key points

Q: What if an acceptor of a quorum fails after a value has been chosen by


the quorum? Say there're 3 acceptors {A, B, C}, x=3 has been chosen by the
quorum {A, B}, but later A crashes, leaving only B with x=3. Is there any
problem?

A: No problem. When a new proposer select a quorum ({B, C} in this


example) to prepare, the quorum will always include B, hence the proposer
will always propose x=3, it's the same as the chosen value.
Best practices - Using leader as the only proposer

In practice, Paxos is usually just used for leader election, only the leader
proposes values. This simplifies the consensus problem as there's no
concurrency any more. When the leader fails, followers run Paxos again to
elect a new leader. During the election, the system is unavailable (typically a
few seconds), so this is a tradeoff between availability and throughput.

As an aside, leader in distributed systems is the same as the leader in the real world. We elect
someone as the president for the next 4 years, she/he makes decisions for us without asking every
single time.
Any questions?
If you find any errors in this tutorial or have any questions, you can comment inline or reach out to
me via weidagang@gmail.com.

You might also like