You are on page 1of 15

The Secret To 10 Million Concurrent

Connections -The Kernel Is The Problem,


Not The Solution
MONDAY, MAY 13, 2013 AT 8:30AM

Now that we have the C10K concurrent connection problem licked, how do we level up and
support 10 million concurrent connections? Impossible you say. Nope, systems right now
are delivering 10 million concurrent connections using techniques that are as radical as they
may be unfamiliar.

To learn how its done we turn to Robert Graham, CEO of Errata Security, and his absolutely
fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that Ive never heard of before. He starts
with a little bit of history, relating how Unix wasnt originally designed to be a general server
OS, it was designed to be a control system for a telephone network. It was the telephone
network that actually transported the data so there was a clean separation between the
control plane and the data plane. The problem is we now use Unix servers as part of the
data plane, which we shouldnt do at all. If we were designing a kernel for handling one
application per server we would design it very differently than for a multi-user kernel.

Which is why he says the key is to understand:

The kernel isnt the solution. The kernel is the problem.

Which means:

Dont let the kernel do all the heavy lifting. Take packet handling, memory
management, and processor scheduling out of the kernel and put it into the application,
where it can be done efficiently. Let Linux handle the control plane and let the the
application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock
cycles for packet handling and 1400 hundred clock cycles for application logic. As a main
memory access costs 300 clock cycles its key to design in way that minimizes code and
cache misses.
With a data plane oriented system you can process 10 million packets per second. With a
control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do
something great you cant outsource performance to the OS. You have to do it yourself.

Now, lets learn how Robert creates a system capable of handling 10 million concurrent
connections...

C10K Problem - So Last Decade


A decade ago engineers tackled the C10K scalability problems that prevented servers from
handling more than 10,000 concurrent connections. This problem was solved by fixing OS
kernels and moving away from threaded servers like Apache to event-driven servers like
Nginx and Node. This process has taken a decade as people have been moving away from
Apache to scalable servers. In the last few years weve seen faster adoption of scalable
servers.

The Apache Problem


The Apache problem is the more connections the worse the performance.

Key insight: performance and scalability or orthogonal concepts. They dont


mean the same thing. When people talk about scale they often are talking about
performance, but theres a difference between scale and performance. As well see with
Apache.

With short term connections that last a few seconds, say a quick transaction, if you
are executing a 1000 TPS then youll only have about a 1000 concurrent connections to the
server.

Change the length of the transactions to 10 seconds, now at 1000 TPS youll have
10K connections open. Apaches performance drops off a cliff though which opens you to
DoS attacks. Just do a lot of downloads and Apache falls over.

If you are handling 5,000 connections per second and you want to handle 10K, what
do you do? Lets say you upgrade hardware and double it the processor speed. What
happens? You get double the performance but you dont get double the scale. The scale
may only go to 6K connections per second. Same thing happens if you keep on doubling.
16x the performance is great but you still havent got to 10K connections. Performance is
not the same as scalability.

The problem was Apache would fork a CGI process and then kill it. This didnt scale.

Why? Servers could not handle 10K concurrent connections because of O(n^2)
algorithms used in the kernel.

Two basic problems in the kernel:

Connection = thread/process. As a packet came in it would walk down


all 10K processes in the kernel to figure out which thread should handle the packet

Connections = select/poll (single thread). Same scalability problem.


Each packet had to walk a list of sockets.

Solution: fix the kernel to make lookups in constant time

Threads now constant time context switch regardless of number of


threads.

Came with a new scalable epoll()/IOCompletionPort constant time


socket lookup.

Thread scheduling still didnt scale so servers scaled using epoll with sockets which
led to the asynchronous programming model embodied in Node and Nginx. This shifted
software to a different performance graph. Even with a slower server when you add more
connections the performance doesnt drop off a cliff. At 10K connections a laptop is even
faster than a 16 core server.

The C10M Problem - The Next Decade


In the very near future servers will need to handle millions of concurrent connections. With
IPV6 the number of potential connections from each server is in the millions so we need to
go to the next level of scalability.
Examples of applications that will need this sort of scalability: IDS/IPS because they
connection to a server backbone. Other examples: DNS root server, TOR node, Nmap of
Internet, video streaming, banking, Carrier NAT, Voip PBX, load balancer, web cache,
firewall, email receive, spam filtering.

Often people who see Internet scale problems are appliances rather than servers
because they are selling hardware + software. You buy the device and insert it into your
datacenter. These devices may contain an Intel motherboard or Network processors and
specialized chips for encryption, packet inspection, etc.

X86 prices on Newegg as of Feb 2013 - $5K for 40gpbs, 32-cores, 256gigs RAM.
The servers can do more than 10K connections. If they cant its because youve made bad
choices with software. Its not the underlying hardware thats the issues. This hardware can
easily scale to 10 million concurrent connections.

What The 10M Concurrent Connection Challenge Means:


1. 10 million concurrent connections

2. 1 million connections/second - sustained rate at about 10 seconds a connections

3. 10 gigabits/second connection - fast connections to the Internet.

4. 10 million packets/second - expect current servers to handle 50K packets per


second, this is going to a higher level. Servers used to be able to handle 100K
interrupts per second and every packet caused interrupts.

5. 10 microsecond latency - scalable servers might handle the scale but latency would
spike.

6. 10 microsecond jitter - limit the maximum latency

7. 10 coherent CPU cores - software should scale to larger numbers of cores. Typically
software only scales easily to four cores. Servers can scale to many more cores so
software needs to be rewritten to support larger core machines.

Weve Learned Unix Not Network Programming


A generation of programmers has learned network programming by reading Unix
Networking Programming by W. Richard Stevens. The problem is the book is about Unix,
not just network programming. It tells you to let Unix do all the heavy lifting and you just
write a small little server on top of Unix. But the kernel doesnt scale. The solution is to move
outside the kernel and do all the heavy lifting yourself.

An example of the impact of this is to consider Apaches thread per connection


model. What this means is the thread scheduler determines which read() to call next
depending on which data arrives. You are using the thread scheduling system as the
packet scheduling system. (I really like this, never thought of it that way before).

What Nginx says it dont use thread scheduling as the packet scheduler. Do the
packet scheduling yourself. Use select to find the socket, we know it has data so we can
read immediately and it wont block, and then process the data.

Lesson: Let Unix handle the network stack, but you handle everything from that
point on.

How Do You Write Software That Scales?


How do change your software to make it scale? A lot of or rules of thumb are false about
how much hardware can handle. We need to know what the performance capabilities
actually are.

To go to the next level the problems we need to solve are:

1. packet scalability

2. multi-core scalability

3. memory scalability

Packet Scaling - Write Your Own Custom Driver To


Bypass The Stack
The problem with packets is they go through the Unix kernel. The network stack is
complicated and slow. The path of packets to your application needs to be more direct.
Dont let the OS handle the packets.

The way to do this is to write your own driver. All the driver does is send the packet to
your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel
DPDK (data plane development kit). The Intel is closed source, but theres a lot of support
around it.

How fast? Intel has a benchmark where the process 80 million packets per second
(200 clock cycles per packet) on a fairly lightweight server. This is through user mode too.
The packet makes its way up through to user mode and then down again to go out. Linux
doesnt do more than a million packets per second when getting UDP packets up to user
mode and out again. Performance is 80-1 of a customer driver to a Linux.

For the 10 million packets per second goal if 200 clock cycles are used in getting the
packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.

With PF_RING you get raw packets so you have to do your TCP stack. People are
doing user mode stacks. For Intel there is an available TCP stack that offers really scalable
performance.

Multi-Core Scalability
Multi-core scalability is not the same thing as multi-threading scalability. Were all familiar
with the idea processors are not getting faster, but we are getting more of them.

Most code doesnt scale past 4 cores. As we add more cores its not just that performance
levels off, we can get slower and slower as we add more cores. Thats because software is
written badly. We want software as we add more cores to scale nearly linearly. Want to get
faster as we add more cores.

Multi-threading coding is not multi-core coding


Multi-threading:

More than one thread per CPU core


Locks to coordinate threads (done via system calls)

Each thread a different task

Multi-core:

One thread per CPU core

When two threads/cores access the same data they cant stop and wait for
each other

All threads part of the same task

Our problem is how to spread an application across many cores.

Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is
that most software starts waiting for other threads to give up a lock. So the kernel starts
eating up more performance than you gain from having more CPUs.

What we need is an architecture that is more like a freeway than an intersection


controlled by a stop light. We want no waiting where everyone continues at their own pace
with as little overhead as possible.

Solutions:

Keep data structures per core. Then on aggregation read all the counters.

Atomics. Instructions supported by the CPU that can called from C.


Guaranteed to be atomic, never conflict. Expensive, so dont want to use for everything.

Lock-free data structures. Accessible by threads that never stop and wait for
each other. Dont do it yourself. Its very complex to work across different architectures.

Threading model. Pipelined vs worker thread model. Its not just


synchronization thats the problem, but how your threads are architected.
Processor affinity. Tell the OS to use the first two cores. Then set where your
threads run on which cores. You can also do the same thing with interrupts. So you own
these CPUs and Linux doesnt.

Memory Scalability
The problem is if you have 20gigs of RAM and lets say you use 2k per connection,
then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock
cycles to go out to main memory, at which time the CPU isnt doing anything.

Think about this with our 1400 clock cycle budge per packet. Remember 200
clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem.

Co-locate Data

Dont scribble data all over memory via pointers. Each time you follow a
pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App].
Thats four cache misses.

Keep all the data together in one chunk of memory: [TCB | Socket | App].
Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.

Paging

The paging table for 32gigs require 64MB of paging tables which doesnt fit in
cache. So you have two caches misses, one for the paging table and one for what it points
to. This is detail we cant ignore for scalable software.

Solutions: compress data; use cache efficient structures instead of binary


search tree that has a lot of memory accesses

NUMA architectures double the main memory access time. Memory may not
be on a local socket but is on another socket.

Memory pools

Preallocate all memory all at once on startup.


Allocate on a per object, per thread, and per socket basis.

Hyper-threading

Network processors can run up to 4 threads per processor, Intel only has 2.

This masks the latency, for example, from memory accesses because when
one thread waits the other goes at full speed.

Hugepages

Reduces page table size. Reserve memory from the start and then your
application manages the memory.

Summary
NIC

Problem: going through the kernel doesnt work well.

Solution: take the adapter away from the OS by using your own driver and
manage them yourself

CPU

Problem: if you use traditional kernel methods to coordinate your application it


doesnt work well.

Solution: Give Linux the first two CPUs and you application manages the
remaining CPUs. No interrupts will happen on those CPUs that you dont allow.

Memory

Problem: Takes special care to make work well.

Solution: At system startup allocate most of the memory in hugepages that


you manage.
The control plane is left to Linux, for the data plane, nothing. The data plane runs in
application code. It never interacts with the kernel. Theres no thread scheduling, no system
calls, no interrupts, nothing.

Yet, what you have is code running on Linux that you can debug normally, its not some sort
of weird hardware system that you need custom engineer for. You get the performance of
custom hardware that you would expect for your data plane, but with your familiar
programming and development environment.

Related Articles
Read on Hacker News or Reddit, Hacker News Dos

Is It Time To Get Rid Of The Linux OS Model In The Cloud?

Machine VM + Cloud API - Rewriting The Cloud From Scratch

Exokernel

Blog on C10M

Multi-core scaling: its not multi-threaded with some good comment action.

Intel DPDK: Data Plane Development Kit

Todd Hoff | 26 Comments | Permalink | Share Article Print Article Email Article

in Example, Performance

Reader Comments (26)


This breakdown of bottlenecks is impressive, it is refreshing to read something this correct.

Now I want to play devils' advocate (mostly because I thoroughly agree w/ this guy). The solutions proposed
sound like customized hardware specific solutions, sound like a move back to the old days, when you could
not just put some fairly random hardware together, slap linux on top and go ... that will be the biggest backlash
to this, people fear appliance/vendor/driver lock-in, and the fear is a rational one.
What are the plans to make these very correct architectural practices available to the layman. Some sort of API
is needed, so individual hardware-stacks can code to it and this API must not be a heavy-weight abstraction.
This is a tough challenge.

Best of luck to the C10M movement, it is brilliant, and I would be a happy programmer if I can slap together a
system that does C10M sometime in the next few years

May 13, 2013 | Russell Sullivan

Great article! I've always found scale fascinating, in general. Postgres 9.2 added support (real support) for up
to 64 cores, which really tickled my fancy. It seems the industry switches back and forth between "just throw
more cheap servers at it... they're cheap" and "let's see how high we can stack this, because it'll be tough to
split it up". I prefer the latter, but a combination thereof, as well as the sort of optimizations you speak of (not
simply delegating something like massive-scale networking to the kernel) tends to move us onward... toward
the robocalypse :)

May 13, 2013 | Michael

Nice post. For memory scalability, you could also look at improving memory locality by controlling process
affinity with something like libnuma/numactl.

May 13, 2013 | B. Estrade

Very good summary about this session

May 13, 2013 | Vijayakumar Ramdoss

About DPDK, you can get the source code from http://dpdk.org ; it provides git, mailing list and good support.
It seems it is driven by 6WIND.

May 14, 2013 | Netengi

The universal scalability law (USL) is exhibited around 30:00 mins into the video presentation, despite his
statement that scalability and performance are unrelated. Note the performance maximum induced by the onset
of multicore coherency delays. Quantifying a similar effect for memcache was presented at Velocity 2010.

May 14, 2013 | Neil Gunther

This is very interesting. The title may be better as "The Linux Kernel is the Problem", as this is different for
other kernels. Just as an example, last time I checked, Linux took 29 stack frames to go from syscalls to
start_xmit(). The illumos/Solaris kernel takes 16 for the similar path (syscall to mac_tx()). FreeBSD took 10
(syscall to ether_output()). These can vary; check your kernel version and workload. I've included them as an
example of stack variance. This should also make the Linux stack more expensive -- but I'd need to analyze
that (cycle based) to quantify.
Memory access is indeed the enemy, and you talk about saving cache misses, but a lot of work has been done
in Linux (and other kernels) for both CPU affinity and memory locality. Is it not working, or not working well
enough? Is there a bug that can be fixed? It would be great to analyze this and root cause the issue. On Linux,
run perf and look at the kernel stacks for cache misses. Better, quantify kernel stacks in TCP/IP for memory
access stall cycles -- and see if they add up and explain the problem.

Right now I'm working on a kernel network performance issue (I do kernel engineering). I think I've found a
kernel bug that can improve (reduce) network stack latency by about 2x for the benchmarks I'm running, found
by doing root cause analysis of the issue.

Such wins won't change overall potential win of bypassing the stack altogether. But it would be best to do this
having understood, and root caused, what the kernel's limits were first.

Bypassing the kernel also means you may need to reinvent some of your toolset for perf analysis (I use a lot of
custom tools that work between syscalls and the device, for analysing TCP latency and dropped packets,
beyond what network sniffing can do).

May 14, 2013 | Brendan Gregg

Reminds me of Van Jacobson's talk at LCA2006. Talking about pushing work out to the endpoints. And the
kernel really isn't and endpoint (especially with SMP), the app is.

May 14, 2013 | Nick

Unix wasn't initially used for controlling telephone networks. Mostly a timesharing system for editing
documents (ASCII text with markup). This is from the BSTJ paper version of the CACM paper.The CACM
paper is earlier and slightly different.
http://cm.bell-labs.com/cm/cs/who/dmr/cacm.html.

Since PDP-11 Unix became operational in February, 1971, over 600 installations have been put into service.
Most of them are engaged in applications such as computer science education, the preparation and formatting
of documents and other textual material, the collection and processing of trouble data from various switching
machines within the Bell System, and recording and checking telephone service orders. Our own installation is
used mainly for research in operating systems, languages, computer networks, and other topics in computer
science, and also for document preparation.

May 14, 2013 | Mark

Give a look in EZchip NPS it implements most of the ideas above.

May 14, 2013 | noam camus


This is no big surprise: general purpose software like a kernel is appropriate for "standard" applications. To
reach the highest performance, custom, finely tuned software can always beat it.

May 14, 2013 | Yves Daoust

Doesn't Netmap address the packet I/O efficiency problem just as well ?

May 15, 2013 | Jean-Marc Liotier

Uhhhh, I hate to break it to you guys, but this problem, and its solutions, has long been known to the
exokernel community. While I admit that Linux and BSD "aren't Unix" in the trademark sense of the term,
they're still both predicated on an OS architecture that is no less than 60 years old (most of the stuff we take for
granted came, in some form or another, from Multics). It's time for a major update. :)

May 15, 2013 | Samuel A. Falvo II

@Russell Sullivan
Nothing about this article is for the layman.

May 16, 2013 | Chris

"Unix wasnt originally designed to be a general server OS, it was designed to be a control system for a
telephone network"

Uh, no, this is completely and utterly wrong. And talking about UNIX is irrelevant, since the Linux kernel was
developed independently.

Which is not to say that the technical arguments aren't right, but this sort of absurd nonsense doesn't help
credibility.

May 16, 2013 | mk

One problem is choke-points (whether shared data structures or mutexes, parallelizable or not) that exist in
user- and kernel- space. The exo-kernel and other approaches simply choose different ways of doing the same
thing by shifting the burden around (IP stack packetting functions). Ultimately, the hardware (real or virtual)
presents the most obvious, finite bottleneck on the solution.

At the ultra high end of other approaches that include silicon, http://tabula.com network-centric apps compiled
w/ app-specific OS frameworks coded in a functional style. This is promising not only for niche problems, but
looks like the deepest way of solving many of the traditional bottlenecks of temporal/spatial dataflow
problems. For 99% of solutions, it's probably wiser to start with vertically-scaled LMAX embedded systems
approaches first http://martinfowler.com/articles/lmax.html after having exhausted commercial gear.
Back at commercial scale, for Xen there are some OS-less runtimes:

Erlang - LING VM
Haskell - halvm

May 16, 2013 | Barry Allard

'Locks in Unix are implemented in the kernel' claim is not absolutely right for systems of today. We have light-
weight user-land locks as in http://en.wikipedia.org/wiki/Futex. Although I agree with the potential speed gains
of an application doing the network packet handling, Memory management (including Paging) and CPU
scheduling are hardly in the domain of an application. Those kind of things should be left to kernel developers
who have a better understanding of the underlying hardware...

May 17, 2013 | Malkocoglu

Interesting how often the architectural pattern of "separation of control and data" shows up. I gave a talk just
the other day to a room full of executives in which I pointed out how this pattern occurred (fractally as it turns
out) in the product we were developing. You can see it in the kinds of large scale mass storage systems I
worked at while at NCAR (control over IP links, but data over high speed I/O channels using specialized
hardware), over the kinds of large telecommunications systems I worked on while at Bell Labs (again,
"signaling" over IP links, but all "bearer" traffic over specialized switching fabrics using completely different
transport mechanisms), in VOIP systems (control using SIP and handled by the software, but RTP bridged as
much as possible directly between endpoints with no handling by the software stack), and even in embedded
systems (bearer over TDM busses directly from the A/D chips, but control via control messages over SPI serial
busses). This pattern has been around at least since the 1960s, and maybe in other, non-digital-control contexts,
even earlier. It's really a division of specialized labor idea, which may go back thousands of years.

May 18, 2013 | Chip Overclock

FYI: We've achieved 12 Million Concurrent Connection using a single 1U server running standard Linux (no
source code changes) while publishing data at more than 1Gbps. See more details at:

http://mrotaru.wordpress.com/2013/06/20/12-million-concurrent-connections-with-migratorydata-websocket-
server/

But, I agree that Linux kernel should still offer substantial improvements as far as a high number of sockets is
concerned. It's good to see that the new versions of Linux kernel (starting with version 3.7) come with some
important improvements in terms of socket-related memory footprint. However. more optimization is
necessary and possible. As mentioned in my post, the memory footprint for 12 million concurrent connections
is about 36 GB. This could be certainly enhanced by the Linux kernel.
June 20, 2013 | Mihai Rotaru

By reading this article, I concluded that it's a new era for future OS design from scratch.

October 31, 2013 | Sampath Kumar

i try PF_RING with lwip,but failed ,learn from IDF 2013 in Beijing that windriver 's develop it'self userspace
stack, which cost 20 developer in 2 year.

November 25, 2013 | rock3

This is great! I was doing it the wrong way, one thread per connection, for a school project that I intended to
use commercially later on.

June 25, 2014 | C99_Forever

This is great! I have implemented a Java Web server using nio and can achieve 10K+ connection on a 8GB
ram, 5 years old desktop computer - but 10M is insane. On linux the number of open sockets appears to be
limited by the ulimit - and not sure if 10M is even possible without kernel tweaks.. I guess for your approach
this isn't relevant.

July 10, 2014 | javaPhobic

Great post. Intel is hiring a Developer Evangelist for DPDK and Network Developing. If you know someone
interested in this space please pass on
http://jobs.intel.com/job/Santa-Clara-Networking-Developer-Evangelist-Job-CA-95050/77213300/

August 8, 2014 | Bob Duffy

"1400 hundred clock cycles"

What kind of dumb, ambiguous way of expressing a number is this? Are you deliberately trying to confuse
peple? You should be using either numerals or words *consisently* throughout the article, let alone the same
figure.

You might also like