You are on page 1of 44

1

That's Billion with a B:


Scaling to the next level at WhatsApp
Rick Reed
WhatsApp
Erlang Factory SF
March 7, 2014

2
About
Me
Joined WhatsApp in 2011
earned Erlang at WhatsApp
Scala!ility " #$lti#edia
%ea#
S#all &'10 on Erlang(
)andle de*elop#ent and ops

3
Erlang
A+eso#e choice ,or WhatsApp
Scala!ility
-on.stop operations

4
Numbers
4/0M #onthly $sers
112 #essages in " 402 o$t per day
/00M pics, 200M *oice, 100M *ideos
147M conc$rrent connections
2304 peak logins5sec
3424 peak #sgs in5sec, 7124 o$t

5
Multimedia olida! "heer
14/6!5s o$t &7hrist#as E*e(
3/0M *ideos do+nloaded &7hrist#as E*e(
22 pics do+nloaded &4/k5s( &-e+ 8ears E*e(
1 pic do+nloaded 32M ti#es &-e+ 8ears E*e(

6
S!stem #verview
chat
chat
chat
chat
chat
chat
chat
chat
chat
chat
chat
chat
Account
Profile
Push
Group
...
mms
mms
mms
mms
mms
mms
mms
mms
mms
mms
Phones
Offline
storage

7
#utput scale
Messages, -oti,ications, " 9resence

8
Throughput scale
(4 of 16 partitions)
psh311 | ERL msg------------------ dist---------------- wan-----------------------
time | nodes qlen qmax nzq re! msgin msgo"t #$q wanin wano"t nodes #$q
%&'&( %)*3%*%1 4%+ % % % 43(661 &&1+%, 13(+6) % % % % %
%&'&( %+*%%*%% 4%+ % % % 4466(+ &&)34, 14%331 % % % % %
%&'&( %+*3%*%1 4%+ % % % 4(4(&, &31(&1 1434+6 % % % % %
%&'&( %,*%%*%1 4%+ % % % 4(()%% &31,3% 143,&+ % % % % %
%&'&( %,*3%*%% 4%+ % % % 4(3))% &31%4+ 14346) % % % % %
mnes----------------------------- io-in---o"t shed g--- mem---
tminq tmo"tq tmin tmo"t nodes #$'s #$'s -"til 'se tot .$
% % 113)1 113)1 4 3&(11 3,&6) 44/% 4)+6% 166+%+
% % 1141+ 114&% 4 33(%& 3,6,3 4(/4 4,&(( 166+1)
% % 114)3 114)& 4 341)1 4%46% 46/3 (%&1& 166+3%
% % 1146, 1146+ 4 343%6 4%+11 46/( (%3)4 166+4)
% % 11&() 11&(4 4 341(, 4%)63 46/3 (%&%+ 166+)%
(& of 16 partitions)
prs1%1 | ERL msg------------- dist--------- mnes---------------------- shed mem---
time | nodes qlen qmax re! msgin msgo"t tminq tmo"tq tmin tmo"t -"til tot .$
%&'&4 1%*%%*%% 4%% % % 3()3+3 1)4,)( 1%44+, % % )6,61 )6,,, &)/) 1(&,)
%&'&4 1%*3%*%% 4%% % % 3(&1)+ 1)&3+, 1%&,)% % % )(,13 )(+,3 &)/3 1(3(&
%&'&4 11*%%*%1 4%% % % 34)643 1)%111 1%16++ % % )4+,4 )4,16 &)/% 1(&&)
%&'&4 11*3%*%1 4%% % % 3413%% 16)%+( ,,+&& % % )346) )34)+ &6/6 1(1)%

9
$b scale
(1 of 16 partitions)
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1ti!e 2a$les Loal 3op4 24pe Reords 54tes
----------------------------------------------------------------------------------------
mmd6o$7&(1&+) dis6opies 16(8+6184)6 3&81()86+18+++
mmd6relaim dis6opies (8+,+8)14 +61843484&4
mmd6ref3(1&+) dis6opies ,3&8+1,8(%( 16+84,4816686&4
mmd6"pload&(1&+) dis6opies 18+)48%4( &6&843%8,&%
mmd6xode3(1&+) dis6opies )8)+681++ &843%86,)8%4%
shema dis6opies (14 (6+8664
----------------------------------------------------------------------------------------
2otal 181148&4%844& &%48&%68,),8(6%
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

10
ardware %lat&orm
' 000 ser*ers : stand!y gear
'100 chat ser*ers &'1M phones each(
'200 ##s ser*ers
2;2/10*2 <*y 2ridge 10.core &40 threads total(
/4.012 62 RAM
SS= &e;cept *ideo(
=$al.link 6igE ; 2 &p$!lic " pri*ate(
> 11,000 cores

11
So&tware %lat&orm
Free2S= 1?2
Erlang R1/201 &:patches(

12
'mproving scalabilit!
=eco$ple
9aralleli@e
=eco$ple
Apti#i@e59atch
=eco$ple
Monitor5Meas$re
=eco$ple

13
$ecouple
Atte#pt to isolate tro$!le5!ottlenecks
=o+nstrea# ser*ices &esp? non.essential(
-eigh!oring partitions
Asynchronicity to #ini#i@e i#pact o, latency
on thro$ghp$t

14
$ecouple
A*oid #nesia t;n co$plingB asyncCdirty
Dse calls only +hen ret$rning data, else cast
Make calls +5 ti#eo$ts onlyB no #onitors
-on.!locking casts &nos$spend( so#eti#es
arge distri!$tion !$,,ers

15
%aralleli(e
Work distri!$tionB start +ith genCser*er
Spread +ork to #$ltiple +orkersB genC,actory
Spread dispatch to #$ltiple procsB genCind$stry
Worker select *ia key &,or d!( or F<FA &,or i5o(
9artitioned ser*ices
Ds$? 2.32 partitions
pg2 addressing
9ri#ary5secondary &$s$? in pairs(

16
%aralleli(e
#nesia
Mostly asyncCdirty
<solate records to 1 node51 process *ia hashing
Each ,rag read5+ritten on only 1 node
M$ltiple #nesiaCt#B parallel replication strea#s
M$ltiple #nesia dirsB parallel i5o d$ring d$#ps
M$ltiple #nesia EislandsF &$s$? 2 nodes5isle(
2etter sche#a ops co#pletion
2etter load.ti#e coordination

17
$ecouple
A*oid head.o,.line !locking
Separate read " +rite G$e$es
Separate inter.node G$e$es
A*oid !locking +hen single node has pro!le#
-ode.to.node #essage ,or+arding
#nesia asyncCdirty replication
EH$e$erF F<FA +orker dispatch

18
#ptimi(e
A,,line storage <5A !ottleneck
<5A !ottleneck +riting to #ail!o;es
Most #essages picked $p *ery G$ickly
Add +rite.!ack cache +ith *aria!le sync delay
7an a!sor! o*erloads *ia sync delay
pop's msgs'p nonz- ah- xa- s4na maxa rd's p"sh's wr's
1&6,4 (/, &4/) )+/3 ,+/) &1 (11+& 41 1)%3( 1%(64

19
#ptimi(e
A,,line storage &recent i#pro*e#ents(
Fi;ed head.o,.line !locking in async ,ile i5o
&2EAM patch to ena!le ro$nd.ro!in async i5o(
More e,,icient handling o, large #ail!o;es
4eep large #ail!o;es ,ro# poll$ting cache

20
#ptimi(e
A*ergro+n SS session cache
Slo+ connection set$p
o+ered cache ti#eo$t

21
#ptimi(e
Slo+ access to #nesia ta!le +ith lots o, ,rags
Acco$nt ta!le has 012 ,rags
Sparse #apping o*er islands5partitions
A,ter adding hosts, thro$ghp$t +ent do+nI
Dn$s$ally slo+ record access
An a h$nch, looked at etsBin,o&stats(
)ash chains >24 &target is 7(? Aops?

22
#ptimi(e
#nesia ,rags &cont?(
S#all percentage o, hash !$ckets !eing $sed
ets $ses average chain length to trigger split
9define .1:6;1<; %xE=======>L
9define ?@A1L?B6;1<; %x========>L
C9define ;1<;6?@?2A1L 33((446)>L

'D optimised !ersion of ma#e6hash (normal aseE atomi #e4) D'
9define .1FE6;1<;(term) G
((is6atom(term) E (atom6ta$(atom6!al(term))-Hslot/$"#et/h!al"e) * G
- ma#e6hash&(term)) - .1:6;1<;)
C ma#e6hash&6init(term8 ;1<;6?@?2A1L)) - .1:6;1<;)

23
%atch
Free2S= 1?2
-o #ore patches
7on,ig ,or large net+ork " RAM

24
%atch
A$r original 2EAM5A%9 con,ig5patches
Allocator con,ig &,or !est s$perpage ,it(
Real.ti#e AS sched$ler priority
Apti#i@ed ti#eo,day deli*ery
<ncreased !i, ti#er hash +idth
<#pro*ed checkCio allocation scala!ility
Apti#i@ed pri#Cinet 5 inet accepts
arger dist recei*e !$,,er

25
%atch
A$r original con,ig5patches &cont?(
Add pg2 denor#ali@ed gro$p #e#!er lists
i#it r$nG task stealing
Add send +5 prepend
Add port re$se ,or pri#C,ileB+riteC,ile
Add gc throttling +5 large #essage G$e$es

26
%atch
-e+ patches &since EFSF 2012 talk(
Add #$ltiple ti#er +heels
Workaro$nd #nesiaCt# selecti*e recei*e
Add #$ltiple #nesiaCt# asyncCdirty senders
Add #ark5set ,or pri#C,ile co##ands
oad #nesia ta!les ,ro# near!y node

27
%atch
-e+ patches &since EFSF 2012 talk( &cont?(
Add ro$nd.ro!in sched$ling ,or async ,ile i5o
Seed ets hash to !reak coincidence +5 phash2
Apti#i@e ets #ain5na#e ta!les ,or scale
=onJt G$e$e #nesia d$#p i, already d$#ping

28
$ecouple
Meta.cl$stering
i#it si@e o, any single cl$ster
Allo+ a cl$ster to span long distances
+andistB dist.like transport o*er genCtcp
Mesh.connected ,$nctional gro$ps o, ser*ers
%ransparent ro$ting layer K$st a!o*e pg2
ocal pg2 #e#!ers p$!lished to ,ar.end
All #essages are single.hop

29
Meta)clustering
=71 #ain cl$ster
=72 #ain cl$ster
=71 ##s cl$ster
=72 ##s cl$ster
glo!al cl$sters

30
Topolog!
=71 #ain cl$ster
=71 ##s cl$ster =72 ##s cl$ster
Acct cl$ster
=71
=72

31
*outing
7l$ster 1
7l$ster 2
ser*ice client
Llast,1M
Llast,2M
Llast,3M
Llast,4M
Llast,1M
Llast,2M
Llast,3M
Llast,4M
Ather cl$ster.local ser*ices
+andist
pg2

32
"learing the mine&ield
6enerally a!le to detect5de,$se scala!ility
#ines !e,ore they e;plode
E*ents +hich test the syste#
World e*ents &esp? soccer(
Ser*er ,ail$res &$s$? RAM(
-et+ork ,ail$res
2ad so,t+are p$shes

33
"learing the mine&ield
-ot al+ays s$ccess,$lB 2522 o$tage
2egan +ith !ack.end ro$ter glitch
Mass node disconnect5reconnect
Res$lted in a no*el $nsta!le state
Dns$ccess,$l in sta!ili@ing cl$ster &esp? pg2(
F$ll stop " restart &,irst ti#e in years(
Also $nco*ered an o*erly.co$pled s$!syste#
Rolling o$t pg2 patch

34
"hallenges
=! scaling, esp? MMS
oad ti#e &'1M o!Kects5sec(
oad ,ail$res &$nreco*era!le !acklog(
2ottlenecked on disk +rite thro$ghp$t &>700M25s(
9atched a selecti*e.recei*e iss$e, !$t #ore to go
Real.ti#e cl$ster stat$s " control at scale
A !$nch o, csshN +indo+s no longer eno$gh
9o+er.o,.2 partitioning

35
+uestions,
rrO +hatsapp?co#
OtdCrr
6it)$!B reedr5otp

36
Monitor-Measure
9er.node syste# #etrics gathering
1.second and 1.#in$te polling
9$shed to 6raphite ,or plotting
9er.node alerting script
AS li#its &79D, #e#, net+ork, disk(
2EAM &r$nning, #sgG !acklog, sleepy scheds(
App.le*el #etrics
9$shed to 6raphite

37
Monitor-Measure
7apacity plan against syste# li#its
79D $tilP, Me# $tilP, =isk ,$llP, =isk !$syP
Watch ,or process #essage G$e$e !acklog
6enerally stri*e to re#o*e all !ack press$re
2ottlenecks sho+ as !acklog
Alert on !acklog > threshold &$s$? 000k(

38
Monitor-Measure

39
Monitor-Measure

40
Monitor-Measure

41
Monitor-Measure

42
Monitor-Measure

43
Monitor-Measure

44
'nput scaling
ogins

You might also like