You are on page 1of 40

Upgrade to MySQL 5.

6 without
downtime
Meetup LeMug.fr @Dailymotion - Paris - Sept 17, 2015

Olivier Dasini - @freshdaz

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz

Me, Myself & I


Olivier DASINI - @freshdaz

MySQL Geek & Data enthusiast

Technical writer, blogger and speaker

Insatiable hunger of learning

co-creator of French MySQL User Group

Olivier Dasini - @freshdaz

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz

Technical background 1/3


Can split MySQL users in 3 types regarding their working set
order of magnitude:

<= Tens of

: 20%

MySQL usage probably not (so) critical


Migration (quite) easy, could be manual

<= Tens of

GBs

TBs

: 75%

MySQL is critical => strong production constraints


Migration should be carefully planned
Need automation however some parts could be manual

>= Hundreds+ of TBs :

5%

MySQL highly critical. think twice (or more) before upgrading.


Same than above w/ automation (everywhere)
Olivier Dasini - @freshdaz

Technical background 2/3


The company :

Software development

Provides a cloud-based customer service platform

~ 1,000 people

~ 60,000 paid customers in 150 countries

Olivier Dasini - @freshdaz

Technical background 3/3


MySQL flavour : Percona Server 5.5 on Fusion IO
Data size : ~ 30 TB | Daily growth rate : up to 40 GB
# MySQL group of replicas (1 Master / n Slaves) : ~ 50
# MySQL instances : ~ 200
Mostly OLTP oriented workload - InnoDB tables
Thousands qps, mostly reads (Selects)
Replication lag sensitive
No downtime allowed!!!
Olivier Dasini - @freshdaz

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz

Why upgrade to 5.6? 1/3


Tons of new cool stuffs :

Security improvements
InnoDB enhancements
Partitioning
Performance Schema
Replication and logging
Optimizer enhancements

Complete list :

http://dev.mysql.com/doc/refman/5.6/en/mysql-nutshell.html
Olivier Dasini - @freshdaz

Why upgrade to 5.6? 2/3


Choose what features we'd like to have.
Team brainstorming...

Define which added features will suit

Pay attention to deprecated features

Schedule when we'll use them


Avoid too many changes at one time
They'll probably be removed in future version
Shouldn't be used anymore

Pay extra attention to removed features

They'll break your server


Olivier Dasini - @freshdaz 10

Why upgrade to 5.6? 3/3


Team brainstorming result :

InnoDB enhancement

Upgrade Confidence Index : 60%

Performance Schema
Replication

Persistent stats
Online DDL
New flushing algo
New checksum algo

Smaller image for Row base replication


Crash safe Master Crash safe binlog
Crash safe Slave Table logging for master / slaves info
GTID (for automatic Switchover/Failover) : [Phase 2]
Parallel replication : [Phase 3]

Optimizer enhancements...
Olivier Dasini - @freshdaz 11

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz 12

Performance testing 1/13


5.6 upgrade will be awesome
(at least in theory)
Many articles proves it, Yeah!
http://dimitrik.free.fr/blog/archives/2013/02/mysql-performance-mysql-56-vs-mysql-55-vsmariadb-55.html
https://blogs.oracle.com/MySQL/entry/mysql_5_6_is_a

Benchmarks never lies :) but is their truth ours?


In real life perf will depend on many factors like workload, hardware,
configurations,
What about us?
Olivier Dasini - @freshdaz 13

Performance testing 2/13

The plan is to get our own numbers


Compare 5.5 and 5.6 performances in a production context
Unfortunately we have customers !!! :)
Out of production but with similar context (as far as
possible)

Data
Queries
Workload
Hardware
Configuration...

=> Ad-hoc 5.6 upgrade on 1 server


Olivier Dasini - @freshdaz 14

Performance testing 3/13


Build 5.6 test server from a 5.5 slave.
Choose a "small" cluster (1.5 TB)
Ad_hoc upgrade is quite straightforward:
Clone a 5.5 server -> Upgrade in 5.6 -> Setting up replication
Steps
Take a binary backup (Xtrabackup) from db5.5 (5.5 instance)
Restore the binary backup on new server (5.6 candidate but still in 5.5)
5.6 binaries upgrade + New configuration (5.6 my.cnf)
mysql_upgrade
Start replication (master is still in 5.5)

Olivier Dasini - @freshdaz 15

Performance testing 4/13


Issue : Fatal replication error 1/2
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'log
event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first
event 'db_master_5.5-bin-log.003440' at 974453835, the last event read from
'/var/log/mysql/db_master_5.5-bin-log.003440' at 974453835, the last byte read from
'/var/log/mysql/db_master_5.5-bin-log.003440' at 974453854.'

On the master binary log:


ERROR: Error in Log_event::read_log_event(): 'Event too big', data_len: 1852797793,
event_type: 104
Could not read entry at offset 974453835: Error in log format or read error.
#150318 18:09:39 server id 174326798 end_log_pos 107 Start: binlog v 4, server v 5.5.3231.0-log created 150318 18:09:39
Olivier Dasini - @freshdaz 16

Performance testing 5/13


Issue : Fatal replication error 2/2

We've never found any explanation.


We tried to increase the max_allowed_packet dynamically
on both master and the 5.6 slave but no effect.
Only 5.6 slave was impacted ie no issues for 5.5 slaves
No fixes except ignore this binlog ie switch to the next
one.

Meaning risks of losing events


Also high risks of inconsistency

So we dropped the data and reloaded a fresh 5.5 dump +


mysql_upgrade.
Olivier Dasini - @freshdaz

17

Performance testing 6/13


The goal is to compare performance between 5.5 & 5.6
5.6 status :

Replicating data as any other 5.5 slaves


Contains production data
Same hardware characteristics

Ready to start our benchmarks \o/

Olivier Dasini - @freshdaz 18

Performance testing 7/13


Tool
pt-upgrade : https://www.percona.com/doc/percona-toolkit/2.2/pt-upgrade.html
pt-upgrade executes queries in the given MySQL LOGS on each DSN, compares the
results, and reports any significant differences. The tool can also save the
results for later analyses. LOGS can be slow, general, binary, tcpdump and raw.
Best practices

Split your (slow) logs into small chunks : 200 ~ 500 MB of data
Easier to manage
Output easier to analyse
Choose carefully your data samples
Capture queries at different time
Reduce the risk to missed important queries
Olivier Dasini - @freshdaz 19

Performance testing 8/13


Phase 1 - Collect Slow Logs
For each collection :

Connect to 5.5 slave in production


Set long_query_time to 0
mysql>
SET GLOBAL long_query_time = 0;
Clean slow log
$ cp /dev/null /var/log/mysql/slow-log
Wait for X mins or watch the slow-log grow to ~300MB (whichever comes 1st)
Set long_query_time to its default value
mysql> SET GLOBAL long_query_time = <DEFAULT_VALUE>;
Copy dated slow log
$ cp /var/log/mysql/slow-log ./slow-log-$(date +"%F-%H-%M-%S")
Clean slow log
$ cp /dev/null /var/log/mysql/slow-log
Olivier Dasini - @freshdaz 20

Performance testing 9/13


Phase 2 - Benchmarks (cold & warm buffers) and Compare 1/2
1.
2.
3.
4.
5.
6.

Ensure both slaves - 5.5 & 5.6 - have no replication lag


Stop replication on db_5.5:
a. mysql_5.5> STOP SLAVE;
Wait for a few seconds....
Stop replication on db_5.6:
a. mysql_5.6> STOP SLAVE;
Note down the master log file and position from the above step-4.
Both slaves should be in perfect sync.
Update db_5.5's master log/position to reflect db_5.6's master
log/position respectively. So the when pt-upgrade is run, it returns the
same set and the number of of rows
a. mysql_5.5> START SLAVE SQL_THREAD UNTIL MASTER_LOG_FILE =
'<log_file>', MASTER_LOG_POS = <log_position>;
Olivier Dasini - @freshdaz 21

Performance testing 10/13


Phase 2 - Benchmarks (cold & warm buffers) and Compare 2/2
7. Run pt-upgrade on db_5.5 (reference results)
a. Cold bench (after a mysql restart)
b. Warm bench (after the first run)
8. Run pt-upgrade on db_5.6
a. Cold bench (after a mysql restart)
b. Warm bench (after the first run)
9. db_5.5. back to production

Olivier Dasini - @freshdaz 22

Performance testing 11/13


Our tests was interesting
Query response time was usually equals or better in 5.6
However we found 1 big query regression

Query time: From (0.09 sec) to (16 min 40.35 sec)

Upgrade Confidence Index : 75%

Olivier Dasini - @freshdaz 23

Performance testing 12/13


Issue : Query regression

Basically Optimizer was chosen the wrong index.


Bug opened to MySQL (by Percona)

Possible fixes :

Disable index extensions algorithm (pre 5.6.9 behavior)


SET optimizer_switch="use_index_extensions=off";
Use hint: IGNORE / FORCE INDEX
IGNORE INDEX (bad_index) || FORCE INDEX (good_index)
Use NULL-safe equal operator ie replace "IS NULL" by "<=> NULL"
column_id <=> NULL
Rewrite query
The most sustainable choice
Many possibilities worked with the appropriate dev team
Olivier Dasini - @freshdaz 24

Performance testing 13/13


As soon as the query was fixed and tested we put the 5.6 in
production.

5.6 is like the other 5.5 slaves


Monitored closely for weeks
Slow query logs analysis chown good numbers

Fewer slow queries


Smaller amount of total slow query time

Smaller CPU usage

So far so good

Upgrade Confidence Index : 90%


Olivier Dasini - @freshdaz 25

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz 26

Preprod upgrade 1/9


Workload different from production : smaller
Data size different from production : tinier
Hardware also different
=> Not relevant for performance tests
But is very important to :
Test the upgrade process
Can't do it manually
Should be transparent for our customers
Know how our internal tools / other apps will behave with 5.6
Databases are used in so many different ways
Can't test them all so if it breaks someone will shout!
Sensibilise other MySQL consumers to this migration
We need their feedback
This step is also very important because an entire cluster downgrade (back to
5.5) is a painful operation
Olivier Dasini - @freshdaz 27

Preprod upgrade 2/9


Preprod technical context
Flavour : Percona Server 5.5 on VMs
Data size : ~ GBs
# MySQL group of replicas : 4
# MySQL instances : 12
Mostly OLTP oriented workload - InnoDB tables
Hundreds qps, mostly reads (Selects)
Replication lag sensitive - Preferably no downtime
Olivier Dasini - @freshdaz 28

Preprod upgrade 3/9


Overall process - Upgrade the 1st slave

Put OOR one slave (per) cluster


Upgrade the slave [more details later]
Put it back to rotation (as a replica)
Checks / Tests / Monitor
Backup the slave (Binary backup w/ Xtrabackup)

Base backup for other slaves

Similar to what we'll use in production (obvious!)


Olivier Dasini - @freshdaz 29

Preprod upgrade 4/9


Overall process - Upgrade the 2nd (other) slave(s)

Put OOR the 5.5 slave


Drop the data
Upgrade the binaries
Restore the 5.6 binary backup on this slave.
Put it back to rotation
Checks / Tests / Monitor

So far, a downgrade is still quite easy:

Binary backup from master, restore to slave after binaries downgrade


Olivier Dasini - @freshdaz 30

Preprod upgrade 5/9


Overall process - Upgrade the master
Last step, easy but very sensitive

Switch master failover

Promote a 5.6 slave to become the new master


Usually less than 1 second in read only mode

Then upgrade the old master & restore it from 5.6 backup
We have our internal tool for switch master failover

but 5.6 broke it


Whole cluster in a read only state without master ie no write allowed
Fortunately that happens in preprod :)
Olivier Dasini - @freshdaz 31

Preprod upgrade 6/9


Issue : Internal tools broken - Switch master failover
The tool uses deprecated statements SLAVE START and SLAVE
STOP, instead of START SLAVE and STOP SLAVE. But they were
removed in 5.6.
In old versions of MySQL (before 4.0.5), this statement was called SLAVE START. This usage
is still accepted in MySQL 5.5 for backward compatibility, but is deprecated and is removed
in MySQL 5.6 : https://dev.mysql.com/doc/refman/5.5/en/start-slave.html
The SLAVE START and SLAVE STOP statements. Use The START SLAVE and STOP SLAVE statements :
http://dev.mysql.com/doc/refman/5.6/en/mysql-nutshell.html

Fix: Use the right statements


Olivier Dasini - @freshdaz 32

Preprod upgrade 7/9


Issue : Internal tools broken - Internal usage
Because of the new configuration, new information are logged
in the binlog:
You can also cause the server to write checksums for the events using CRC32
checksums by setting the binlog_checksum system variable : http://dev.mysql.
com/doc/refman/5.6/en/mysql-nutshell.html
http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.
html#sysvar_binlog_checksum

These tools parses the binlog


Fix : Development by the relevant team

Olivier Dasini - @freshdaz 33

Preprod upgrade 8/9


Upgrade workflow 1/2
1. Extract schema and data + Pre-upgrade checks
2. Drop MySQL directories (datadir, logdir)

[ binaries upgraded to 5.6 by OPS + Disk encryption ] : OPS tasks


3. Load schema + Post-upgrade checks
4. Load data + Post-upgrade check2 & Compare differences in "before" & "after"
checks

Checks: object count, charset,...


Olivier Dasini - @freshdaz 34

Preprod upgrade 9/9


Upgrade workflow 2/2

Upgrade process was split in a dozen of scripts


Theses scripts was called by 4 main wrapper scripts for convenience
2 types of granularity provide more flexibility
In case of issue DBAs can resume the process "manually" at any step
An extra step can easily be added eg (schema modification)

Automation is important
Tasks are pretty straightforward but time consuming
Lowering risk of error
Hundreds of servers

DBA needs to be aware of the status


Script sends emails to DBAs when
Task is completed
In case of error

Upgrade Confidence Index : 95%

Olivier Dasini - @freshdaz 35

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz 36

Prod upgrade
Final step(s), final tests

Preprod is similar but not identical to prod.


To be more comfortable we

Added extra slaves on our smaller clusters


Ran the full process on them

Not possible to test the switch master failover


But we were confident enough to start, so we started

In progress...
Upgrade Confidence Index : 99%

Olivier Dasini - @freshdaz 37

Agenda
Me, Myself & I
Technical background
Why upgrade to 5.6?
Performance testing
Preprod upgrade
Production upgrade
Wrap-up
Olivier Dasini - @freshdaz 38

Wrap-up

Identified what's relevant for you in the new release

Make your own tests

Easier to manage/debug/...

Automation

Performance : related to your workload / data set


Functional : are your apps depend on a removed/changed feature?

Split the work in lots

Understand the changes : added / removed features


Don't be an earlier adopter (if you don't have a proper support team)
: let other clean the way

Manual things are error prone


Write it once, use it at will

Communication

Explain / describe what you are going to do


Involve consumers, looking for their feedback

Olivier Dasini - @freshdaz 39

Questions?
Olivier DASINI
Twitter : @freshdaz
Mail

: olivier@dasini.net

Skype

: olivier.dasini

Thank you!
Olivier Dasini - @freshdaz 40

You might also like