You are on page 1of 49

Inside the Social

Networks
(Datacenter)
Network
FACEBOOK, INC.

CLOUD COMPUTING -CS 643-103

Overview
INTRODUCTION
Terminology
Datacenter topology
PROVISIONING
TRAFFIC ENGINEERING
SWITCHING
CONCLUSION

CLOUD COMPUTING -CS 643-103

Introduction
This paper mainly report the traffic observed in the facebook datacenter
While Facebook operates a number of traditional datacenter services like
Hadoop, its core Web service and supporting cache infrastructure exhibit a
number of behaviors that contrast with those reported in the literature.

CLOUD COMPUTING -CS 643-103

Purpose of the paper


We report on the contrasting locality, stability, and predictability of network
traffic in Facebooks datacenters, and comment on their implications for
network architecture, traffic engineering, and switch design.

CLOUD COMPUTING -CS 643-103

Datacenter Networks
1. datacenters are changing the way we
design our networks
2. mainly because of the need to pack
large number of independent nodes in
a small space while the network is
designed for loosely coupled systems
that are scattered across the globe
3. while many aspects of the network
design depend on the physical
attributes but in the case of
datacenter we also need to know the
amount of demand that will be placed
on the network

1.

we do know about the physical aspects of datacenter, i.e. connecting 10s


of thousands of systems using 10-gbps Ethernet connection, the actual
demand that is placed is not known and no disclosed publicly

2.

so many of the studies are based on light validated assumptions or by


analyzing the workloads from a single server

CLOUD COMPUTING -CS 643-103

1. This paper studies the sample workloads


from facebook datacenters, We find that
traffic studies in the literature are not
entirely representative of Facebooks
demands, calling into question the
applicability of some of the proposals
based upon these prevalent assumptions
on datacenter traffic behavior.
2. lacking concrete data datacenters
designs in literature are often designed
for the worst case namely on all-to-all
traffic matrix in which each host
communicates with every other host
such design exhibits maximum bisection
bandwidth which is an overkill when
demand exhibits maximum locality

CLOUD COMPUTING -CS 643-103

1. If demand can be predicted


and/or remains stable over
reasonable time periods, it may
be feasible to provide circuit-like
connectivity between portions of
the datacenter. Alternatively,
network controllers could select
among existing paths in an
intelligent fashion. Regardless of
the technology involved, all of
these techniques require traffic
to be predictable over non-trivial
time scales

Network Improvement
Suggestions
There have been straight forward
proposals like decreased buffering,
port count, or changing the various
layers of switching fabric, while there
have been some complex
suggestions like using circuit based
switches instead of packet based
switches

there are many studies on this that are being conducted by universities and
private datacenters, almost all of the previous studies of largescale like 10k hosts
or larger consider Microsoft datacenters , while there are few similarities between
datacenters of Microsoft and facebook but there are also critical distinctions, we
describe those differences and explain the reasons behind them

CLOUD COMPUTING -CS 643-103

this is the first study on production


traffic in a datacenter with hundreds
of thousands of 10gbps nodes
findings include
- traffic is neither rack-local nor all-toall; locality depends on service but is
stable across time periods from
seconds-to-days

- Many traffic flows are long lived but not heavy, load balancing effectively
distributes across hosts so much that even in the sub second intervals the traffic
is quite stable, as a result heavy hitters are not heavy hitters in over time
-packets are small and do not exhibit on/off arrival behavior. Servers
communicate with
- 100s hosts and racks but majority of traffic is often destined to less than 10
racks at a time

CLOUD COMPUTING -CS 643-103

Terminology
CLOUD COMPUTING -CS 643-103

Machine - Rack Cluster

CLOUD COMPUTING -CS 643-103

10

Terms used in the paper


FC(Fat Cats)
CSW(Cluster Switch)
RSW(Rack Switch)
CDF(Cumulative Distribution
Frequency)
Clos
Fbflows

Cache (Leader, follower)


MF(Multi feed)
SLB(Server Load Balancer)
Data site
Data Center
Cluster
Rack
Locality

Heavy Hitters

CLOUD COMPUTING -CS 643-103

11

Related Work
its hard to tell how many literatures
show the traffic from Microsoft
datacenters,
the web or cloud data that dealt with
generating Reponses for web requests
like mail, messenger are different from
the ones that deal with map-reduce
style workflow computation on top of
distributed block store

the major thing in the prior studies,


first is majority of traffic is found to be rack local, consequence of application
patterns observed (around 86% is reported to be rack local)
second traffic is frequently reported to be burst and unstable across a variety of
timescales, studies have shown that traffic is unpredictable in when considering
a timescale of more than 150s or longer, while it becomes some what
predictable in range of few seconds time interval. traffic engineering
mechanisms have to target that time scale.
finally previous studies have shown that packet size is either large reaching MTU
limit or small like TCP ACK size, but facebook packet size is median despite the
10-GBPs link size, while there are orders of magnitude of simultaneous
connections the communication between hosts is extremely low.

facebook services fall into the first


category.

CLOUD COMPUTING -CS 643-103

12

Datacenter
topology

CLOUD COMPUTING -CS 643-103

13

Facebook Datacenter Network


Topology
facebooks datacenter consist of
multiple datacenters sites and a
backbone connecting these sites,
each datacenter site contains one or
more datacenter buildings (so a
center) where datacenter contains
multiple clusters. a cluster is
considered a unit of deployment in
facebook datacenters. each cluster is
made by a 3-tier topology depicted in
figure.

Machines are organized in racks and conneted by a switch called top-of-rackswitch (RSW) visa 10-GBPS links
the number of machines per rack varies from cluster to cluster. each RSW in turn
is connected to a aggregation switch called cluster switches (CSWs).

CLOUD COMPUTING -CS 643-103

14

CLOUD COMPUTING -CS 643-103

15

All racks served by the same switch


are said to be in the same cluster.
cluster may be homogenous like
cache clusters, or heterogeneous like
frontend clusters which have mix of
web servers, load balancers and
cache servers.

CSWs are connected to each other view another layer of aggregation switches
called FAT CATS (FC)
finally CSWs also connect to aggregation switches for intra-site (but interdatacenter) traffic and datacenter routers for intrasite traffic.
majority of facebook current datacenter's employ this 4-post Clos design.

CLOUD COMPUTING -CS 643-103

16

one distinctive aspect of facebook's datacenter is that each machine typically


has one precise role.
webserver serve web traffic, mysql store data, cache servers including
leaders, which handle
cache coherency and cache followers which serve most read requests,
hadoop servers handle
offline analysis and datamining, multi-feed servers assemble news-feed.
while there are number of other roles these will be focus of our study

CLOUD COMPUTING -CS 643-103

17

CLOUD COMPUTING -CS 643-103

18

What Happens when a request


is made
when a http query hits a facebook datacenter,
it arrives at a layer four software load balancer.
the query is redirected to web server which is state less, containing no user
data.
they fetch the data from cache tier, in case of cache miss a cache server will
then fetch data from the database tier.
At the same time webserver may communicate with one or to fetch objects
such as nwes stories and ads.
in contrast hadoop clusters perform offline analysis such as data mining

CLOUD COMPUTING -CS 643-103

19

CLOUD COMPUTING -CS 643-103

20

Data Collection
due to the size of facebook data servers it is impractical to try to collect the
traffic dumps.
there are two ways in this paper that are described to collect the data to
analyze

CLOUD COMPUTING -CS 643-103

21

1. FBFLOW
fb flow is production monitoring system that samples packet headers from
facebooks entier fleet. its architecture comprises of agents and taggers.
taggers are inserted into iptable rules. the datasets we consider in this paper
are collected with a 1:30000 sampling rate.
a user level fbflow agent process on each machines listens to the nflow flag
and parses the headers, extracting information source destination ip address
port etc..
they are then converted to a json object and sent to Scuba a realm time data
analytics system. and are also simultaneously stored

CLOUD COMPUTING -CS 643-103

22

2. Port Mirroring
while fbflow is a powerful tool for networking it's sampling based
we deploy monitoring hosts in five different racks across facebook's
datacenter network, locating them in clusters that host distinct services.
in particular we monitor a rack of web servers, a Hadoop node, a cache
followers racks etc. .we collect ptraces by turning on port mirroring on RSW
(top of the rack switch) in all but one web instance.
for web servers the traffic is low enough that we are able to mirror traffic
from rack of servers to collection host.
we are not using tcp dump, because it's not able to handle beyond 1.5Gbps
of traffic.

CLOUD COMPUTING -CS 643-103

23

PROVISIONING

CLOUD COMPUTING -CS 643-103

24

1. facebook recently transitioned to 10 - GBPS ethernet across all of their


hosts.
2. utilization across between the hosts and their RSW's is less than 1%
3. Even the most loaded links are lightly loaded over 1-minute time scales:
99% of all links are typically less than 10% loaded. the heaviest clusters
(hadoop) is roughtly 5 times clusters with light load (frontend).
4. links between RSW and CSW median utilization varies between 10-20%
across clusters, with the busiest 5% links seeing 23-46% utilization.

CLOUD COMPUTING -CS 643-103

25

Locality and stability


In the figure a hadoop server within a hadoop cluster, a web server in a
frontend cluster, and both cache follower and cache leader from within the
same cache cluster.
for each server each second traffic is represented as a stacked bar chart. with
rack local traffic cyan.
among the four types hadoop shows the most diversity in both servers and
time. some traces show periods of significant network activity while all others
do not.
In hadoop, in one ten minute long trace 99.8% traffic is destined to other
hadoop servers and 75.7% traffic is destined to servers in the same rack.
in webservers nearly 70% stays within the cluster. 80% of which is destined to
cache systems , the multi feed systems and the SLB get the 8% each.

CLOUD COMPUTING -CS 643-103

26

CLOUD COMPUTING -CS 643-103

27

cache systems depending on type see markedly different localities, frontend


cache followers primarily send traffic in the form of responses to webservers
and thus high intra cluster traffic. due to load balancing this traffic is spread
widely.
during 2 min interval the cache follower communicates with 75% of hosts in
the cluster and 90% of the webservers.
cache leaders maintain coherency across clusters and backing databases.
engaging primarily in intra and inter datacenter traffic.

CLOUD COMPUTING -CS 643-103

28

CLOUD COMPUTING -CS 643-103

29

Implications
the low utilization levels found at the edge of the network reinforce common
practice of oversubscribing the aggregation and core of the network,
although it remains to be seen whether utilization will creep up as the
datacenters age.

CLOUD COMPUTING -CS 643-103

30

TRAFFIC
ENGINEERING
ANALYZE FACEBOOKS TRAFFIC AT FINE TIMESCALES, WITH
AN EYE TOWARDS UNDERSTANDING HOW APPLICABLE
VARIOUS TRAFFIC ENGINEERING AND LOAD BAL ANCING
APPROACHES MIGHT BE UNDER SUCH CONDITIONS.

CLOUD COMPUTING -CS 643-103

31

Flow characteristics
Most flows in Facebooks Hadoop cluster are short. Traffic demands of
Hadoop vary substantially across nodes and time
Even in the graphed interval, however, 70% of flows send less than 10 KB and last less than 10
seconds
Median flow sends less than 1 KB and lasts less than a second. Less than 5% of the flows are larger
than 1 MB or last longer than 100 seconds; almost none exceed our 10-minute trace
But, FBs Other services are more representative due to load balancing as they use some sort of
connection pooling (esp. for cache follower). Regardless of flow size or length, flows tend to be internally
bursty.
Considering higher levels of aggregation by destination host or rack, it was observed that distribution
of flow sizes simply shifts to the right for webserver retaining its original shape.
However, for cache follower it is very tight distribution around 1MB per host. This is as a consequence
of load balancing user requests across all webservers combined with large number of user requests.
CLOUD COMPUTING -CS 643-103

32

CLOUD COMPUTING -CS 643-103

33

Load balancing
We need to find variability of traffic. Regular traffic does not give us opportunity to
optimize.
Examining the distribution of flow rates, aggregated by destination rack, per second over
a two minute period and compare each second to the next. Intuitively, the better the load
balancing, the closer one second appears to the next.

CLOUD COMPUTING -CS 643-103

34

Contd..
Hadoop has periods of varying network traffic, and the production cluster is likely to see
a myriad jobs of varying sizes.
It is this variability of traffic that existing network traffic engineering schemes seek to
leverage.
Orchestra relies on temporal and per-job variation to provide lower task completion
times for high-priority tasks, while Hedera provides non-interfering route placement for
high bandwidth elephant flows that last for several seconds, which are prevalent within
Hadoop workloads.
Different story emerges for Frontend traffic, and the cache in particular.
The distributions for each of the 120 seconds are similar, and all are relatively tight, i.e.,
the CDFs are fairly vertical about the median of approx. 2 Mbps.
However, this analysis does not take per destination variation into consideration. It is
conceivable that there could exist consistently high- or low-rate destinations that
potentially could be treated differently by a traffic engineering scheme.
It is observed that per-destination-rack flow sizes are remarkably stable across not only
seconds but intervals as long as 10 seconds (not shown).
CLOUD COMPUTING -CS 643-103

35

Heavy Hitters
Set of flows(hosts/racks) responsible for 50% of observed traffic volume over a fixed time period.
Heavy hitter persistence is low for individual flows (red): in the median case, no more than roughly 15% of flows
persist regardless of the length of period, a consequence of the internal burstiness of flows noted earlier.
Host-level aggregation (green) fares little better; with the exception of destination-host-level aggregation for
Web servers, no more than 20% of heavy hitter hosts in a sub-second interval will persist as a heavy hitter in the
next interval.
It is not until considering rack-level flows (blue) that heavy hitters are particularly stable. In the median case,
over 40% of cache heavy hitters persist into the next 100- ms interval, and almost 60% for Web servers.
Even so, heavy hitter persistence is not particularly favorable for traffic engineering. With a close to 50%
chance of a given heavy hitter continuing in the next time period, predicting a heavy hitter by observation is not
much more effective than randomly guessing.

CLOUD COMPUTING -CS 643-103

36

Even if one could perfectly predict the heavy

hitters on a second-by-second timescale, it


remains to consider how useful that knowledge
would be.
First, it establishes an upper bound on the
effectiveness of traffic engineering a significant
amount of short lived heavy hitter traffic would
go unseen and untreated by the TE scheme.
Second, it serves as an indicator of the difficulty
of prediction in the first place. If a one-second
prediction interval is not sufficient, smaller
timescales (consuming more resources) may be
needed.
Finally, this metric is an indicator of burstiness,
as it indicates the presence of a large number of
heavy hitters lasting for a very short time.

CLOUD COMPUTING -CS 643-103

37

Adjacent figure plots a CDF of the


fraction of a seconds overall heavy
hitters that are instantaneously
heavy in each 1/10/100-ms interval
within the second.
We can observe that results for a
Web server and cache follower
cache leaders are similar.
At 5-tuple granularity, predictive
power is quite poor, at less than 10
15%.
Rack-level predictions are much
more effective, with heavy hitters
remaining heavy in the majority of
100-ms intervals in the median case
for both services.
Host level predictions are more
useful for Web servers than cache
CLOUD COMPUTING -CS 643-103
nodes, but only the 100-ms case is

38

Implications for traffic engineering


Services in FB Extensive use of connection pooling with long lived flows seems they are
potential candidates for traffic engineering. However the same services use application
level load balancing to great effect, leaving limited room for in-network approaches.
Many existing techniques work on identifying heavy hitters and then treating them
differently. (e.g., provisioning a circuit, moving them to a lightly loaded path, employing
alternate buffering strategies, etc.).
Unfortunately, it appears challenging to identify heavy hitters in a number of Facebooks
clusters that persist with any frequency.
Previous work has suggested traffic engineering schemes can be effective if 35% of
traffic is predictable only rack-level heavy hitters reach that level of predictability for
either Web or cache servers.
This somewhat counter-intuitive situation results from a combination of effective load
balancing (which means there is little difference in size between a heavy hitter and the
CLOUD COMPUTING -CS 643-103
39
median flow) and the relatively low long-term
throughput of most flows, meaning even

SWITCHING
W E S T U DY A S P E C T S O F T H E T RA F F I C T H AT B E A R D I R E C T LY O N T O P- O F RA C K S W I TC H D E S I G N . I N PA RT I C U L A R , W E C O N S I D E R T H E S I Z E A N D
A R R I VA L P R O C E S S E S O F PAC K E T S , A N D T H E N U M B E R O F C O N C U R R E N T
D E S T I N AT I O N S F O R A N Y PA RT I C U L A R E N D H O S T. I N A D D I T I O N , W E
E X A M I N E T H E I M PAC T O F B U R S T I N E S S O V E R S H O RT T I M E S C A L E S A N D I T S
I M PAC T O N S W I TC H B U F F E R I N G
CLOUD COMPUTING -CS 643-103

40

Per-packet features
Overall Packet Size is 250 bytes which is skewed
by Hadoop packets which utilize MTU(1500 here).
Other services have much wider distribution but
median size is much less than 200 bytes.
While utilization in low the packet rate is high.
For example, a cache server at 10% link
utilization with a median packet size of roughly
175 bytes generates 85% of the packet rate of a
fully utilized link sending MTU sized packets.
As a result, any per-packet operation (e.g., VLAN
encapsulation) may still be stressed in a way that
the pure link utilization rate might not suggest at
first glance.

CLOUD COMPUTING -CS 643-103

41

Arrival patterns
As observed in previous literature Facebook arrival
packets does not exhibit on/off pattern at the end host
level.
fig 14. While a significant amount of traffic is routed
over long-lived pooled connections, as is the case for
request-response traffic between Web servers and cache
followers, ephemeral flows do exist.
Hadoop nodes and Web servers see an order ofmagnitude increase in flow intensity relative to previous
reportslikely due at least in part to the 10 increase in
link ratewith median inter-arrival times of
approximately 2 ms .
Perhaps due to connection pooling (which would
decouple the arrival of external user requests from the
internal flow-arrival rate), the distribution of inter-arrival
times for flows at both types of cache node are similar
and longer: cache leaders see a slightly higher arrival
rate than followers, with median inter arrival times of 3
and 8 ms, respectively.
CLOUD COMPUTING -CS 643-103

42

Buffer Utilization
The combination of a lack of on/off traffic, higher flow intensity, and bursty
individual flows suggests a potential increase in buffer utilization and
overruns.
The first is that standing buffer occupancies are non-trivial, and can be
quite high in the Web-server case.
Even though link utilization is on the order of 1% most of the time, over
two-thirds of the available shared buffer is utilized during each 10-s interval.
Daily variation exists in buffer occupancies, utilization and drop rate,
highlighting the correlation between these metrics over time. Even with the
daily traffic pattern, however, the maximum buffer occupancy in the Webserver rack approaches the configured limit for roughly three quarters of the
24-hour period.
While link utilization is roughly correlated with buffer occupancy within the
Web-server rack, utilization by itself is not a good prediction of buffer
requirements across different applications. In particular, the Cache rack has
higher link utilization, but much lower buffer utilization and drop rates
These buffer utilization levels occur despite relatively small packet sizes. As
utilization increases in the future, it might be through an increase in the
number of flows, in the size of packets, or both.
Either will impact the buffer utilization.

CLOUD COMPUTING -CS 643-103

43

Concurrent flows
It was found that Web servers and cache hosts have 100s to 1000s of concurrent connections
(at the 5-tuple level), while Hadoop nodes have approximately 25 concurrent connections on
average.
If we group connections destined to the same host, the numbers reduce only slightly (by at
most a factor of two)and not at all in the case of Hadoop.
Cache followers communicate with 225300 different racks, while leaders talk to 175350.
The median number of heavy-hitter racks is 68 for Web servers and cache leaders with an
effective max of 2030, while the cache follower has 29 heavy hitter racks in the median case
and up to 50 in the tail.
Due to the differences in locality, Web servers and cache followers have very few rack-level
heavy hitters of their cluster, while the cache leader displays the opposite pattern.
The relative impermanence of our heavy hitters suggests that, for Frontend clusters at least,
hybrid circuit-based approaches may be challenging to employ.

CLOUD COMPUTING -CS 643-103

44

CLOUD COMPUTING -CS 643-103

45

CONCLUSION
FACEBOOKS DATACENTER NETWORK SUPPORTS A VARIETY
OF DISTINCT SERVICES THAT EXHIBIT DIFFERENT TRAFFIC
PATTERNS AND SEVERAL DEVIATE SUBSTANTIALLY FROM
THE SERVICES CONSIDERED IN THE LITERATURE
CLOUD COMPUTING -CS 643-103

46

The different applications, combined with the scale (hundreds of thousands of


nodes) and speed (10-Gbps edge links) of Facebooks datacenter network result
in workloads that contrast in a number of ways from most previously published
datasets.
Sever features were described that may have implications for topology, traffic
engineering, and top-of-rack switch design.
While Fbflow is deployed datacenter-wide, the sheer amount of measurement
data it provides presents another challengespecifically, one of data
processing and retentionwhich limits the resolution at which it can operate.
Hence we conclude that effective network monitoring and analysis to be an
ongoing and constantly evolving problem.

CLOUD COMPUTING -CS 643-103

47

Questions??
CLOUD COMPUTING -CS 643-103

48

Thank You
CLOUD COMPUTING -CS 643-103

49

You might also like