Professional Documents
Culture Documents
Now that we have the C10K concurrent connection problem licked, how do we level up and
support 10 million concurrent connections? Impossible you say. Nope, systems right now
are delivering 10 million concurrent connections using techniques that are as radical as they
may be unfamiliar.
To learn how its done we turn to Robert Graham, CEO of Errata Security, and his absolutely
fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.
Robert has a brilliant way of framing the problem that Ive never heard of before. He starts
with a little bit of history, relating how Unix wasnt originally designed to be a general server
OS, it was designed to be a control system for a telephone network. It was the telephone
network that actually transported the data so there was a clean separation between the
control plane and the data plane. The problem is we now use Unix servers as part of the
data plane, which we shouldnt do at all. If we were designing a kernel for handling one
application per server we would design it very differently than for a multi-user kernel.
Which means:
Dont let the kernel do all the heavy lifting. Take packet handling, memory
management, and processor scheduling out of the kernel and put it into the application,
where it can be done efficiently. Let Linux handle the control plane and let the the
application handle the data plane.
The result will be a system that can handle 10 million concurrent connections with 200 clock
cycles for packet handling and 1400 hundred clock cycles for application logic. As a main
memory access costs 300 clock cycles its key to design in way that minimizes code and
cache misses.
With a data plane oriented system you can process 10 million packets per second. With a
control plane oriented system you only get 1 million packets per second.
If this seems extreme keep in mind the old saying: scalability is specialization. To do
something great you cant outsource performance to the OS. You have to do it yourself.
Now, lets learn how Robert creates a system capable of handling 10 million concurrent
connections...
With short term connections that last a few seconds, say a quick transaction, if you
are executing a 1000 TPS then youll only have about a 1000 concurrent connections to the
server.
Change the length of the transactions to 10 seconds, now at 1000 TPS youll have
10K connections open. Apaches performance drops off a cliff though which opens you to
DoS attacks. Just do a lot of downloads and Apache falls over.
If you are handling 5,000 connections per second and you want to handle 10K, what
do you do? Lets say you upgrade hardware and double it the processor speed. What
happens? You get double the performance but you dont get double the scale. The scale
may only go to 6K connections per second. Same thing happens if you keep on doubling.
16x the performance is great but you still havent got to 10K connections. Performance is
not the same as scalability.
The problem was Apache would fork a CGI process and then kill it. This didnt scale.
Why? Servers could not handle 10K concurrent connections because of O(n^2)
algorithms used in the kernel.
Thread scheduling still didnt scale so servers scaled using epoll with sockets which
led to the asynchronous programming model embodied in Node and Nginx. This shifted
software to a different performance graph. Even with a slower server when you add more
connections the performance doesnt drop off a cliff. At 10K connections a laptop is even
faster than a 16 core server.
Often people who see Internet scale problems are appliances rather than servers
because they are selling hardware + software. You buy the device and insert it into your
datacenter. These devices may contain an Intel motherboard or Network processors and
specialized chips for encryption, packet inspection, etc.
X86 prices on Newegg as of Feb 2013 - $5K for 40gpbs, 32-cores, 256gigs RAM.
The servers can do more than 10K connections. If they cant its because youve made bad
choices with software. Its not the underlying hardware thats the issues. This hardware can
easily scale to 10 million concurrent connections.
5. 10 microsecond latency - scalable servers might handle the scale but latency would
spike.
7. 10 coherent CPU cores - software should scale to larger numbers of cores. Typically
software only scales easily to four cores. Servers can scale to many more cores so
software needs to be rewritten to support larger core machines.
What Nginx says it dont use thread scheduling as the packet scheduler. Do the
packet scheduling yourself. Use select to find the socket, we know it has data so we can
read immediately and it wont block, and then process the data.
Lesson: Let Unix handle the network stack, but you handle everything from that
point on.
1. packet scalability
2. multi-core scalability
3. memory scalability
The way to do this is to write your own driver. All the driver does is send the packet to
your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel
DPDK (data plane development kit). The Intel is closed source, but theres a lot of support
around it.
How fast? Intel has a benchmark where the process 80 million packets per second
(200 clock cycles per packet) on a fairly lightweight server. This is through user mode too.
The packet makes its way up through to user mode and then down again to go out. Linux
doesnt do more than a million packets per second when getting UDP packets up to user
mode and out again. Performance is 80-1 of a customer driver to a Linux.
For the 10 million packets per second goal if 200 clock cycles are used in getting the
packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.
With PF_RING you get raw packets so you have to do your TCP stack. People are
doing user mode stacks. For Intel there is an available TCP stack that offers really scalable
performance.
Multi-Core Scalability
Multi-core scalability is not the same thing as multi-threading scalability. Were all familiar
with the idea processors are not getting faster, but we are getting more of them.
Most code doesnt scale past 4 cores. As we add more cores its not just that performance
levels off, we can get slower and slower as we add more cores. Thats because software is
written badly. We want software as we add more cores to scale nearly linearly. Want to get
faster as we add more cores.
Multi-core:
When two threads/cores access the same data they cant stop and wait for
each other
Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is
that most software starts waiting for other threads to give up a lock. So the kernel starts
eating up more performance than you gain from having more CPUs.
Solutions:
Keep data structures per core. Then on aggregation read all the counters.
Lock-free data structures. Accessible by threads that never stop and wait for
each other. Dont do it yourself. Its very complex to work across different architectures.
Memory Scalability
The problem is if you have 20gigs of RAM and lets say you use 2k per connection,
then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock
cycles to go out to main memory, at which time the CPU isnt doing anything.
Think about this with our 1400 clock cycle budge per packet. Remember 200
clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem.
Co-locate Data
Dont scribble data all over memory via pointers. Each time you follow a
pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App].
Thats four cache misses.
Keep all the data together in one chunk of memory: [TCB | Socket | App].
Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.
Paging
The paging table for 32gigs require 64MB of paging tables which doesnt fit in
cache. So you have two caches misses, one for the paging table and one for what it points
to. This is detail we cant ignore for scalable software.
NUMA architectures double the main memory access time. Memory may not
be on a local socket but is on another socket.
Memory pools
Hyper-threading
Network processors can run up to 4 threads per processor, Intel only has 2.
This masks the latency, for example, from memory accesses because when
one thread waits the other goes at full speed.
Hugepages
Reduces page table size. Reserve memory from the start and then your
application manages the memory.
Summary
NIC
Solution: take the adapter away from the OS by using your own driver and
manage them yourself
CPU
Solution: Give Linux the first two CPUs and you application manages the
remaining CPUs. No interrupts will happen on those CPUs that you dont allow.
Memory
Yet, what you have is code running on Linux that you can debug normally, its not some sort
of weird hardware system that you need custom engineer for. You get the performance of
custom hardware that you would expect for your data plane, but with your familiar
programming and development environment.
Related Articles
Read on Hacker News or Reddit, Hacker News Dos
Exokernel
Blog on C10M
Multi-core scaling: its not multi-threaded with some good comment action.
Todd Hoff | 26 Comments | Permalink | Share Article Print Article Email Article
in Example, Performance
Now I want to play devils' advocate (mostly because I thoroughly agree w/ this guy). The solutions proposed
sound like customized hardware specific solutions, sound like a move back to the old days, when you could
not just put some fairly random hardware together, slap linux on top and go ... that will be the biggest backlash
to this, people fear appliance/vendor/driver lock-in, and the fear is a rational one.
What are the plans to make these very correct architectural practices available to the layman. Some sort of API
is needed, so individual hardware-stacks can code to it and this API must not be a heavy-weight abstraction.
This is a tough challenge.
Best of luck to the C10M movement, it is brilliant, and I would be a happy programmer if I can slap together a
system that does C10M sometime in the next few years
Great article! I've always found scale fascinating, in general. Postgres 9.2 added support (real support) for up
to 64 cores, which really tickled my fancy. It seems the industry switches back and forth between "just throw
more cheap servers at it... they're cheap" and "let's see how high we can stack this, because it'll be tough to
split it up". I prefer the latter, but a combination thereof, as well as the sort of optimizations you speak of (not
simply delegating something like massive-scale networking to the kernel) tends to move us onward... toward
the robocalypse :)
Nice post. For memory scalability, you could also look at improving memory locality by controlling process
affinity with something like libnuma/numactl.
About DPDK, you can get the source code from http://dpdk.org ; it provides git, mailing list and good support.
It seems it is driven by 6WIND.
The universal scalability law (USL) is exhibited around 30:00 mins into the video presentation, despite his
statement that scalability and performance are unrelated. Note the performance maximum induced by the onset
of multicore coherency delays. Quantifying a similar effect for memcache was presented at Velocity 2010.
This is very interesting. The title may be better as "The Linux Kernel is the Problem", as this is different for
other kernels. Just as an example, last time I checked, Linux took 29 stack frames to go from syscalls to
start_xmit(). The illumos/Solaris kernel takes 16 for the similar path (syscall to mac_tx()). FreeBSD took 10
(syscall to ether_output()). These can vary; check your kernel version and workload. I've included them as an
example of stack variance. This should also make the Linux stack more expensive -- but I'd need to analyze
that (cycle based) to quantify.
Memory access is indeed the enemy, and you talk about saving cache misses, but a lot of work has been done
in Linux (and other kernels) for both CPU affinity and memory locality. Is it not working, or not working well
enough? Is there a bug that can be fixed? It would be great to analyze this and root cause the issue. On Linux,
run perf and look at the kernel stacks for cache misses. Better, quantify kernel stacks in TCP/IP for memory
access stall cycles -- and see if they add up and explain the problem.
Right now I'm working on a kernel network performance issue (I do kernel engineering). I think I've found a
kernel bug that can improve (reduce) network stack latency by about 2x for the benchmarks I'm running, found
by doing root cause analysis of the issue.
Such wins won't change overall potential win of bypassing the stack altogether. But it would be best to do this
having understood, and root caused, what the kernel's limits were first.
Bypassing the kernel also means you may need to reinvent some of your toolset for perf analysis (I use a lot of
custom tools that work between syscalls and the device, for analysing TCP latency and dropped packets,
beyond what network sniffing can do).
Reminds me of Van Jacobson's talk at LCA2006. Talking about pushing work out to the endpoints. And the
kernel really isn't and endpoint (especially with SMP), the app is.
Unix wasn't initially used for controlling telephone networks. Mostly a timesharing system for editing
documents (ASCII text with markup). This is from the BSTJ paper version of the CACM paper.The CACM
paper is earlier and slightly different.
http://cm.bell-labs.com/cm/cs/who/dmr/cacm.html.
Since PDP-11 Unix became operational in February, 1971, over 600 installations have been put into service.
Most of them are engaged in applications such as computer science education, the preparation and formatting
of documents and other textual material, the collection and processing of trouble data from various switching
machines within the Bell System, and recording and checking telephone service orders. Our own installation is
used mainly for research in operating systems, languages, computer networks, and other topics in computer
science, and also for document preparation.
Doesn't Netmap address the packet I/O efficiency problem just as well ?
Uhhhh, I hate to break it to you guys, but this problem, and its solutions, has long been known to the
exokernel community. While I admit that Linux and BSD "aren't Unix" in the trademark sense of the term,
they're still both predicated on an OS architecture that is no less than 60 years old (most of the stuff we take for
granted came, in some form or another, from Multics). It's time for a major update. :)
@Russell Sullivan
Nothing about this article is for the layman.
"Unix wasnt originally designed to be a general server OS, it was designed to be a control system for a
telephone network"
Uh, no, this is completely and utterly wrong. And talking about UNIX is irrelevant, since the Linux kernel was
developed independently.
Which is not to say that the technical arguments aren't right, but this sort of absurd nonsense doesn't help
credibility.
One problem is choke-points (whether shared data structures or mutexes, parallelizable or not) that exist in
user- and kernel- space. The exo-kernel and other approaches simply choose different ways of doing the same
thing by shifting the burden around (IP stack packetting functions). Ultimately, the hardware (real or virtual)
presents the most obvious, finite bottleneck on the solution.
At the ultra high end of other approaches that include silicon, http://tabula.com network-centric apps compiled
w/ app-specific OS frameworks coded in a functional style. This is promising not only for niche problems, but
looks like the deepest way of solving many of the traditional bottlenecks of temporal/spatial dataflow
problems. For 99% of solutions, it's probably wiser to start with vertically-scaled LMAX embedded systems
approaches first http://martinfowler.com/articles/lmax.html after having exhausted commercial gear.
Back at commercial scale, for Xen there are some OS-less runtimes:
Erlang - LING VM
Haskell - halvm
'Locks in Unix are implemented in the kernel' claim is not absolutely right for systems of today. We have light-
weight user-land locks as in http://en.wikipedia.org/wiki/Futex. Although I agree with the potential speed gains
of an application doing the network packet handling, Memory management (including Paging) and CPU
scheduling are hardly in the domain of an application. Those kind of things should be left to kernel developers
who have a better understanding of the underlying hardware...
Interesting how often the architectural pattern of "separation of control and data" shows up. I gave a talk just
the other day to a room full of executives in which I pointed out how this pattern occurred (fractally as it turns
out) in the product we were developing. You can see it in the kinds of large scale mass storage systems I
worked at while at NCAR (control over IP links, but data over high speed I/O channels using specialized
hardware), over the kinds of large telecommunications systems I worked on while at Bell Labs (again,
"signaling" over IP links, but all "bearer" traffic over specialized switching fabrics using completely different
transport mechanisms), in VOIP systems (control using SIP and handled by the software, but RTP bridged as
much as possible directly between endpoints with no handling by the software stack), and even in embedded
systems (bearer over TDM busses directly from the A/D chips, but control via control messages over SPI serial
busses). This pattern has been around at least since the 1960s, and maybe in other, non-digital-control contexts,
even earlier. It's really a division of specialized labor idea, which may go back thousands of years.
FYI: We've achieved 12 Million Concurrent Connection using a single 1U server running standard Linux (no
source code changes) while publishing data at more than 1Gbps. See more details at:
http://mrotaru.wordpress.com/2013/06/20/12-million-concurrent-connections-with-migratorydata-websocket-
server/
But, I agree that Linux kernel should still offer substantial improvements as far as a high number of sockets is
concerned. It's good to see that the new versions of Linux kernel (starting with version 3.7) come with some
important improvements in terms of socket-related memory footprint. However. more optimization is
necessary and possible. As mentioned in my post, the memory footprint for 12 million concurrent connections
is about 36 GB. This could be certainly enhanced by the Linux kernel.
June 20, 2013 | Mihai Rotaru
By reading this article, I concluded that it's a new era for future OS design from scratch.
i try PF_RING with lwip,but failed ,learn from IDF 2013 in Beijing that windriver 's develop it'self userspace
stack, which cost 20 developer in 2 year.
This is great! I was doing it the wrong way, one thread per connection, for a school project that I intended to
use commercially later on.
This is great! I have implemented a Java Web server using nio and can achieve 10K+ connection on a 8GB
ram, 5 years old desktop computer - but 10M is insane. On linux the number of open sockets appears to be
limited by the ulimit - and not sure if 10M is even possible without kernel tweaks.. I guess for your approach
this isn't relevant.
Great post. Intel is hiring a Developer Evangelist for DPDK and Network Developing. If you know someone
interested in this space please pass on
http://jobs.intel.com/job/Santa-Clara-Networking-Developer-Evangelist-Job-CA-95050/77213300/
What kind of dumb, ambiguous way of expressing a number is this? Are you deliberately trying to confuse
peple? You should be using either numerals or words *consisently* throughout the article, let alone the same
figure.