You are on page 1of 22

8/22/2015

TheC10Kproblem

TheC10Kproblem
[HelpsavethebestLinuxnewssourceonthewebsubscribetoLinuxWeeklyNews!]
It'stimeforwebserverstohandletenthousandclientssimultaneously,don'tyouthink?Afterall,theweb
isabigplacenow.
Andcomputersarebig,too.Youcanbuya1000MHzmachinewith2gigabytesofRAMandan
1000Mbit/secEthernetcardfor$1200orso.Let'sseeat20000clients,that's50KHz,100Kbytes,and
50Kbits/secperclient.Itshouldn'ttakeanymorehorsepowerthanthattotakefourkilobytesfromthe
diskandsendthemtothenetworkonceasecondforeachoftwentythousandclients.(Thatworksoutto
$0.08perclient,bytheway.Those$100/clientlicensingfeessomeoperatingsystemschargearestarting
tolookalittleheavy!)Sohardwareisnolongerthebottleneck.
In1999oneofthebusiestftpsites,cdrom.com,actuallyhandled10000clientssimultaneouslythrougha
GigabitEthernetpipe.Asof2001,thatsamespeedisnowbeingofferedbyseveralISPs,whoexpectitto
becomeincreasinglypopularwithlargebusinesscustomers.
Andthethinclientmodelofcomputingappearstobecomingbackinstylethistimewiththeserverout
ontheInternet,servingthousandsofclients.
Withthatinmind,hereareafewnotesonhowtoconfigureoperatingsystemsandwritecodetosupport
thousandsofclients.ThediscussioncentersaroundUnixlikeoperatingsystems,asthat'smypersonal
areaofinterest,butWindowsisalsocoveredabit.

Contents
TheC10Kproblem
RelatedSites
BooktoReadFirst
I/Oframeworks
I/OStrategies
1. Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggeredreadiness
notification
Thetraditionalselect()
Thetraditionalpoll()
/dev/poll(Solaris2.7+)
kqueue(FreeBSD,NetBSD)
2. Servemanyclientswitheachthread,andusenonblockingI/Oandreadinesschange
notification
epoll(Linux2.6+)
Polyakov'skevent(Linux2.6+)
Drepper'sNewNetworkInterface(proposalforLinux2.6+)
RealtimeSignals(Linux2.4+)
Signalperfd
kqueue(FreeBSD,NetBSD)
3. Servemanyclientswitheachthread,anduseasynchronousI/Oandcompletionnotification
4. Serveoneclientwitheachserverthread
LinuxThreads(Linux2.0+)
http://www.kegel.com/c10k.html#related

1/22

8/22/2015

TheC10Kproblem

NGPT(Linux2.4+)
NPTL(Linux2.6,RedHat9)
FreeBSDthreadingsupport
NetBSDthreadingsupport
Solaristhreadingsupport
JavathreadingsupportinJDK1.3.xandearlier
Note:1:1threadingvs.M:Nthreading
5. Buildtheservercodeintothekernel
6. BringtheTCPstackintouserspace
Comments
Limitsonopenfilehandles
Limitsonthreads
Javaissues[Updated27May2001]
Othertips
ZeroCopy
Thesendfile()systemcallcanimplementzerocopynetworking.
Avoidsmallframesbyusingwritev(orTCP_CORK)
SomeprogramscanbenefitfromusingnonPosixthreads.
Cachingyourowndatacansometimesbeawin.
Otherlimits
KernelIssues
MeasuringServerPerformance
Examples
Interestingselect()basedservers
Interesting/dev/pollbasedservers
Interestingepollbasedservers
Interestingkqueue()basedservers
Interestingrealtimesignalbasedservers
Interestingthreadbasedservers
Interestinginkernelservers
Otherinterestinglinks

RelatedSites
SeeNickBlack'sexecellentFastUNIXServerspageforacirca2009lookatthesituation.
InOctober2003,FelixvonLeitnerputtogetheranexcellentwebpageandpresentationaboutnetwork
scalability,completewithbenchmarkscomparingvariousnetworkingsystemcallsandoperating
systems.Oneofhisobservationsisthatthe2.6Linuxkernelreallydoesbeatthe2.4kernel,butthereare
many,manygoodgraphsthatwillgivetheOSdevelopersfoodforthoughtforsometime.(Seealsothe
Slashdotcommentsit'llbeinterestingtoseewhetheranyonedoesfollowupbenchmarksimprovingon
Felix'sresults.)

BooktoReadFirst
Ifyouhaven'treaditalready,gooutandgetacopyofUnixNetworkProgramming:NetworkingApis:
SocketsandXti(Volume1)bythelateW.RichardStevens.ItdescribesmanyoftheI/Ostrategiesand
pitfallsrelatedtowritinghighperformanceservers.Iteventalksaboutthe'thunderingherd'problem.
Andwhileyou'reatit,goreadJeffDarcy'snotesonhighperformanceserverdesign.
http://www.kegel.com/c10k.html#related

2/22

8/22/2015

TheC10Kproblem

(Anotherbookwhichmightbemorehelpfulforthosewhoare*using*ratherthan*writing*aweb
serverisBuildingScalableWebSitesbyCalHenderson.)

I/Oframeworks
Prepackagedlibrariesareavailablethatabstractsomeofthetechniquespresentedbelow,insulatingyour
codefromtheoperatingsystemandmakingitmoreportable.
ACE,aheavyweightC++I/Oframework,containsobjectorientedimplementationsofsomeof
theseI/Ostrategiesandmanyotherusefulthings.Inparticular,hisReactorisanOOwayofdoing
nonblockingI/O,andProactorisanOOwayofdoingasynchronousI/O.
ASIOisanC++I/OframeworkwhichisbecomingpartoftheBoostlibrary.It'slikeACEupdated
fortheSTLera.
libeventisalightweightCI/OframeworkbyNielsProvos.Itsupportskqueueandselect,andsoon
willsupportpollandepoll.It'sleveltriggeredonly,Ithink,whichhasbothgoodandbadsides.
Nielshasanicegraphoftimetohandleoneeventasafunctionofthenumberofconnections.It
showskqueueandsys_epollasclearwinners.
Myownattemptsatlightweightframeworks(sadly,notkeptuptodate):
PollerisalightweightC++I/OframeworkthatimplementsaleveltriggeredreadinessAPI
usingwhateverunderlyingreadinessAPIyouwant(poll,select,/dev/poll,kqueue,orsigio).
It'susefulforbenchmarksthatcomparetheperformanceofthevariousAPIs.Thisdocument
linkstoPollersubclassesbelowtoillustratehoweachofthereadinessAPIscanbeused.
rnisalightweightCI/OframeworkthatwasmysecondtryafterPoller.It'slgpl(soit's
easiertouseincommercialapps)andC(soit'seasiertouseinnonC++apps).Itwasused
insomecommercialproducts.
MattWelshwroteapaperinApril2000abouthowtobalancetheuseofworkerthreadandevent
driventechniqueswhenbuildingscalableservers.ThepaperdescribespartofhisSandstormI/O
framework.
CoryNelson'sScale!libraryanasyncsocket,file,andpipeI/OlibraryforWindows

I/OStrategies
Designersofnetworkingsoftwarehavemanyoptions.Hereareafew:
WhetherandhowtoissuemultipleI/Ocallsfromasinglethread
Don'tuseblocking/synchronouscallsthroughout,andpossiblyusemultiplethreadsor
processestoachieveconcurrency
Usenonblockingcalls(e.g.write()onasocketsettoO_NONBLOCK)tostartI/O,and
readinessnotification(e.g.poll()or/dev/poll)toknowwhenit'sOKtostartthenextI/Oon
thatchannel.GenerallyonlyusablewithnetworkI/O,notdiskI/O.
Useasynchronouscalls(e.g.aio_write())tostartI/O,andcompletionnotification(e.g.
signalsorcompletionports)toknowwhentheI/Ofinishes.Goodforbothnetworkanddisk
I/O.
Howtocontrolthecodeservicingeachclient
oneprocessforeachclient(classicUnixapproach,usedsince1980orso)
oneOSlevelthreadhandlesmanyclientseachclientiscontrolledby:
auserlevelthread(e.g.GNUstatethreads,classicJavawithgreenthreads)
astatemachine(abitesoteric,butpopularinsomecirclesmyfavorite)
acontinuation(abitesoteric,butpopularinsomecircles)
oneOSlevelthreadforeachclient(e.g.classicJavawithnativethreads)
http://www.kegel.com/c10k.html#related

3/22

8/22/2015

TheC10Kproblem

oneOSlevelthreadforeachactiveclient(e.g.TomcatwithapachefrontendNT
completionportsthreadpools)
WhethertousestandardO/Sservices,orputsomecodeintothekernel(e.g.inacustomdriver,
kernelmodule,orVxD)
Thefollowingfivecombinationsseemtobepopular:
1. Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggeredreadiness
notification
2. Servemanyclientswitheachthread,andusenonblockingI/Oandreadinesschangenotification
3. Servemanyclientswitheachserverthread,anduseasynchronousI/O
4. serveoneclientwitheachserverthread,anduseblockingI/O
5. Buildtheservercodeintothekernel

1.Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggered
readinessnotification
...setnonblockingmodeonallnetworkhandles,anduseselect()orpoll()totellwhichnetworkhandle
hasdatawaiting.Thisisthetraditionalfavorite.Withthisscheme,thekerneltellsyouwhetherafile
descriptorisready,whetherornotyou'vedoneanythingwiththatfiledescriptorsincethelasttimethe
kerneltoldyouaboutit.(Thename'leveltriggered'comesfromcomputerhardwaredesignit'sthe
oppositeof'edgetriggered'.JonathonLemonintroducedthetermsinhisBSDCON2000paperon
kqueue().)
Note:it'sparticularlyimportanttorememberthatreadinessnotificationfromthekernelisonlyahintthe
filedescriptormightnotbereadyanymorewhenyoutrytoreadfromit.That'swhyit'simportanttouse
nonblockingmodewhenusingreadinessnotification.
Animportantbottleneckinthismethodisthatread()orsendfile()fromdiskblocksifthepageisnotin
coreatthemomentsettingnonblockingmodeonadiskfilehandlehasnoeffect.Samethinggoesfor
memorymappeddiskfiles.ThefirsttimeaserverneedsdiskI/O,itsprocessblocks,allclientsmust
wait,andthatrawnonthreadedperformancegoestowaste.
ThisiswhatasynchronousI/Oisfor,butonsystemsthatlackAIO,workerthreadsorprocessesthatdo
thediskI/Ocanalsogetaroundthisbottleneck.Oneapproachistousememorymappedfiles,andif
mincore()indicatesI/Oisneeded,askaworkertodotheI/O,andcontinuehandlingnetworktraffic.Jef
PoskanzermentionsthatPai,Druschel,andZwaenepoel's1999Flashwebserverusesthistrickthey
gaveatalkatUsenix'99onit.Itlookslikemincore()isavailableinBSDderivedUnixeslikeFreeBSD
andSolaris,butisnotpartoftheSingleUnixSpecification.It'savailableaspartofLinuxasofkernel
2.3.51,thankstoChuckLever.
ButinNovember2003onthefreebsdhackerslist,VivekPeietalreportedverygoodresultsusing
systemwideprofilingoftheirFlashwebservertoattackbottlenecks.Onebottlenecktheyfoundwas
mincore(guessthatwasn'tsuchagoodideaafterall)Anotherwasthefactthatsendfileblocksondisk
accesstheyimprovedperformancebyintroducingamodifiedsendfile()thatreturnsomethinglike
EWOULDBLOCKwhenthediskpageit'sfetchingisnotyetincore.(Notsurehowyoutelltheuserthe
pageisnowresident...seemstomewhat'sreallyneededhereisaio_sendfile().)Theendresultoftheir
optimizationsisaSpecWeb99scoreofabout800ona1GHZ/1GBFreeBSDbox,whichisbetterthan
anythingonfileatspec.org.
ThereareseveralwaysforasinglethreadtotellwhichofasetofnonblockingsocketsarereadyforI/O:
http://www.kegel.com/c10k.html#related

4/22

8/22/2015

TheC10Kproblem

Thetraditionalselect()
Unfortunately,select()islimitedtoFD_SETSIZEhandles.Thislimitiscompiledintothestandard
libraryanduserprograms.(SomeversionsoftheClibraryletyouraisethislimitatuserapp
compiletime.)
SeePoller_select(cc,h)foranexampleofhowtouseselect()interchangeablywithotherreadiness
notificationschemes.
Thetraditionalpoll()
Thereisnohardcodedlimittothenumberoffiledescriptorspoll()canhandle,butitdoesgetslow
aboutafewthousand,sincemostofthefiledescriptorsareidleatanyonetime,andscanning
throughthousandsoffiledescriptorstakestime.
SomeOS's(e.g.Solaris8)speeduppoll()etalbyuseoftechniqueslikepollhinting,whichwas
implementedandbenchmarkedbyNielsProvosforLinuxin1999.
SeePoller_poll(cc,h,benchmarks)foranexampleofhowtousepoll()interchangeablywithother
readinessnotificationschemes.
/dev/poll
ThisistherecommendedpollreplacementforSolaris.
Theideabehind/dev/pollistotakeadvantageofthefactthatoftenpoll()iscalledmanytimeswith
thesamearguments.With/dev/poll,yougetanopenhandleto/dev/poll,andtelltheOSjustonce
whatfilesyou'reinterestedinbywritingtothathandlefromthenon,youjustreadthesetof
currentlyreadyfiledescriptorsfromthathandle.
ItappearedquietlyinSolaris7(seepatchid106541)butitsfirstpublicappearancewasinSolaris
8accordingtoSun,at750clients,thishas10%oftheoverheadofpoll().
Variousimplementationsof/dev/pollweretriedonLinux,butnoneofthemperformaswellas
epoll,andwereneverreallycompleted./dev/polluseonLinuxisnotrecommended.
SeePoller_devpoll(cc,hbenchmarks)foranexampleofhowtouse/dev/pollinterchangeably
withmanyotherreadinessnotificationschemes.(CautiontheexampleisforLinux/dev/poll,
mightnotworkrightonSolaris.)
kqueue()
ThisistherecommendedpollreplacementforFreeBSD(and,soon,NetBSD).
Seebelow.kqueue()canspecifyeitheredgetriggeringorleveltriggering.

2.Servemanyclientswitheachthread,andusenonblockingI/Oandreadiness
changenotification
Readinesschangenotification(oredgetriggeredreadinessnotification)meansyougivethekernelafile
descriptor,andlater,whenthatdescriptortransitionsfromnotreadytoready,thekernelnotifiesyou
somehow.Itthenassumesyouknowthefiledescriptorisready,andwillnotsendanymorereadiness
notificationsofthattypeforthatfiledescriptoruntilyoudosomethingthatcausesthefiledescriptorto
nolongerbeready(e.g.untilyoureceivetheEWOULDBLOCKerroronasend,recv,oracceptcall,ora
sendorrecvtransferslessthantherequestednumberofbytes).
http://www.kegel.com/c10k.html#related

5/22

8/22/2015

TheC10Kproblem

Whenyouusereadinesschangenotification,youmustbepreparedforspuriousevents,sinceone
commonimplementationistosignalreadinesswheneveranypacketsarereceived,regardlessofwhether
thefiledescriptorwasalreadyready.
Thisistheoppositeof"leveltriggered"readinessnotification.It'sabitlessforgivingofprogramming
mistakes,sinceifyoumissjustoneevent,theconnectionthateventwasforgetsstuckforever.
Nevertheless,Ihavefoundthatedgetriggeredreadinessnotificationmadeprogrammingnonblocking
clientswithOpenSSLeasier,soit'sworthtrying.
[Banga,Mogul,Drusha'99]describedthiskindofschemein1999.
ThereareseveralAPIswhichlettheapplicationretrieve'filedescriptorbecameready'notifications:
kqueue()ThisistherecommendededgetriggeredpollreplacementforFreeBSD(and,soon,
NetBSD).
FreeBSD4.3andlater,andNetBSDcurrentasofOct2002,supportageneralizedalternativeto
poll()calledkqueue()/kevent()itsupportsbothedgetriggeringandleveltriggering.(Seealso
JonathanLemon'spageandhisBSDCon2000paperonkqueue().)
Like/dev/poll,youallocatealisteningobject,butratherthanopeningthefile/dev/poll,youcall
kqueue()toallocateone.Tochangetheeventsyouarelisteningfor,ortogetthelistofcurrent
events,youcallkevent()onthedescriptorreturnedbykqueue().Itcanlistennotjustforsocket
readiness,butalsoforplainfilereadiness,signals,andevenforI/Ocompletion.
Note:asofOctober2000,thethreadinglibraryonFreeBSDdoesnotinteractwellwithkqueue()
evidently,whenkqueue()blocks,theentireprocessblocks,notjustthecallingthread.
SeePoller_kqueue(cc,h,benchmarks)foranexampleofhowtousekqueue()interchangeably
withmanyotherreadinessnotificationschemes.
Examplesandlibrariesusingkqueue():
PyKQueueaPythonbindingforkqueue()
RonaldF.Guilmette'sexampleechoserverseealsohis28Sept2000poston
freebsd.questions.
epoll
Thisistherecommendededgetriggeredpollreplacementforthe2.6Linuxkernel.
On11July2001,DavideLibenziproposedanalternativetorealtimesignalshispatchprovides
whathenowcalls/dev/epollwww.xmailserver.org/linuxpatches/nioimprove.html.Thisisjust
liketherealtimesignalreadinessnotification,butitcoalescesredundantevents,andhasamore
efficientschemeforbulkeventretrieval.
Epollwasmergedintothe2.5kerneltreeasof2.5.46afteritsinterfacewaschangedfromaspecial
filein/devtoasystemcall,sys_epoll.Apatchfortheolderversionofepollisavailableforthe2.4
kernel.
Therewasalengthydebateaboutunifyingepoll,aio,andothereventsourcesonthelinuxkernel
mailinglistaroundHalloween2002.Itmayyethappen,butDavideisconcentratingonfirmingup
epollingeneralfirst.
http://www.kegel.com/c10k.html#related

6/22

8/22/2015

TheC10Kproblem

Polyakov'skevent(Linux2.6+)Newsflash:On9Feb2006,andagainon9July2006,Evgeniy
PolyakovpostedpatcheswhichseemtounifyepollandaiohisgoalistosupportnetworkAIO.
See:
theLWNarticleaboutkevent
hisJulyannouncement
hiskeventpage
hisnaiopage
somerecentdiscussion
Drepper'sNewNetworkInterface(proposalforLinux2.6+)
AtOLS2006,UlrichDrepperproposedanewhighspeedasynchronousnetworkingAPI.See:
hispaper,"TheNeedforAsynchronous,ZeroCopyNetworkI/O"
hisslides
LWNarticlefromJuly22
RealtimeSignals
Thisistherecommendededgetriggeredpollreplacementforthe2.4Linuxkernel.
The2.4linuxkernelcandeliversocketreadinesseventsviaaparticularrealtimesignal.Here'show
toturnthisbehavioron:
/*MaskoffSIGIOandthesignalyouwanttouse.*/
sigemptyset(&sigset);
sigaddset(&sigset,signum);
sigaddset(&sigset,SIGIO);
sigprocmask(SIG_BLOCK,&m_sigset,NULL);
/*Foreachfiledescriptor,invokeF_SETOWN,F_SETSIG,andsetO_ASYNC.*/
fcntl(fd,F_SETOWN,(int)getpid());
fcntl(fd,F_SETSIG,signum);
flags=fcntl(fd,F_GETFL);
flags|=O_NONBLOCK|O_ASYNC;
fcntl(fd,F_SETFL,flags);

ThissendsthatsignalwhenanormalI/Ofunctionlikeread()orwrite()completes.Tousethis,
writeanormalpoll()outerloop,andinsideit,afteryou'vehandledallthefd'snoticedbypoll(),
youloopcallingsigwaitinfo().
Ifsigwaitinfoorsigtimedwaitreturnsyourrealtimesignal,siginfo.si_fdandsiginfo.si_bandgive
almostthesameinformationaspollfd.fdandpollfd.reventswouldafteracalltopoll(),soyou
handlethei/o,andcontinuecallingsigwaitinfo().
IfsigwaitinforeturnsatraditionalSIGIO,thesignalqueueoverflowed,soyouflushthesignal
queuebytemporarilychangingthesignalhandlertoSIG_DFL,andbreakbacktotheouterpoll()
loop.
SeePoller_sigio(cc,h)foranexampleofhowtousertsignalsinterchangeablywithmanyother
readinessnotificationschemes.
SeeZachBrown'sphhttpdforexamplecodethatusesthisfeaturedirectly.(Ordon'tphhttpdisa
bithardtofigureout...)
[Provos,Lever,andTweedie2000]describesarecentbenchmarkofphhttpdusingavariantof
sigtimedwait(),sigtimedwait4(),thatletsyouretrievemultiplesignalswithonecall.Interestingly,
thechiefbenefitofsigtimedwait4()forthemseemedtobeitallowedtheapptogaugesystem
overload(soitcouldbehaveappropriately).(Notethatpoll()providesthesamemeasureofsystem
http://www.kegel.com/c10k.html#related

7/22

8/22/2015

TheC10Kproblem

overload.)
Signalperfd
ChandraandMosbergerproposedamodificationtotherealtimesignalapproachcalled"signal
perfd"whichreducesoreliminatesrealtimesignalqueueoverflowbycoalescingredundant
events.Itdoesn'toutperformepoll,though.Theirpaper(www.hpl.hp.com/techreports/2000/HPL
2000174.html)comparesperformanceofthisschemewithselect()and/dev/poll.
VitalyLubanannouncedapatchimplementingthisschemeon18May2001hispatchlivesat
www.luban.org/GPL/gpl.html.(Note:asofSept2001,theremaystillbestabilityproblemswith
thispatchunderheavyload.dkftpbenchatabout4500usersmaybeabletotriggeranoops.)
SeePoller_sigfd(cc,h)foranexampleofhowtousesignalperfdinterchangeablywithmany
otherreadinessnotificationschemes.

3.Servemanyclientswitheachserverthread,anduseasynchronousI/O
ThishasnotyetbecomepopularinUnix,probablybecausefewoperatingsystemssupportasynchronous
I/O,alsopossiblybecauseit(likenonblockingI/O)requiresrethinkingyourapplication.Understandard
Unix,asynchronousI/Oisprovidedbytheaio_interface(scrolldownfromthatlinkto"Asynchronous
inputandoutput"),whichassociatesasignalandvaluewitheachI/Ooperation.Signalsandtheirvalues
arequeuedanddeliveredefficientlytotheuserprocess.ThisisfromthePOSIX1003.1brealtime
extensions,andisalsointheSingleUnixSpecification,version2.
AIOisnormallyusedwithedgetriggeredcompletionnotification,i.e.asignalisqueuedwhenthe
operationiscomplete.(Itcanalsobeusedwithleveltriggeredcompletionnotificationbycalling
aio_suspend(),thoughIsuspectfewpeopledothis.)
glibc2.1andlaterprovideagenericimplementationwrittenforstandardscomplianceratherthan
performance.
BenLaHaise'simplementationforLinuxAIOwasmergedintothemainLinuxkernelasof2.5.32.It
doesn'tusekernelthreads,andhasaveryefficientunderlyingapi,but(asof2.6.0test2)doesn'tyet
supportsockets.(ThereisalsoanAIOpatchforthe2.4kernels,butthe2.5/2.6implementationis
somewhatdifferent.)Moreinfo:
Thepage"KernelAsynchronousI/O(AIO)SupportforLinux"whichtriestotietogetherallinfo
aboutthe2.6kernel'simplementationofAIO(posted16Sept2003)
Round3:aiovs/dev/epollbyBenjaminC.R.LaHaise(presentedat2002OLS)
AsynchronousI/OSuportinLinux2.5,byBhattacharya,Pratt,Pulaverty,andMorgan,IBM
presentedatOLS'2003
DesignNotesonAsynchronousI/O(aio)forLinuxbySuparnaBhattacharyacomparesBen's
AIOwithSGI'sKAIOandafewotherAIOprojects
LinuxAIOhomepageBen'spreliminarypatches,mailinglist,etc.
linuxaiomailinglistarchives
libaiooraclelibraryimplementingstandardPosixAIOontopoflibaio.FirstmentionedbyJoel
Beckeron18Apr2003.
SuparnaalsosuggestshavingalookatthetheDAFSAPI'sapproachtoAIO.
RedHatASandSuseSLESbothprovideahighperformanceimplementationonthe2.4kernelitis
http://www.kegel.com/c10k.html#related

8/22

8/22/2015

TheC10Kproblem

relatedto,butnotcompletelyidenticalto,the2.6kernelimplementation.
InFebruary2006,anewattemptisbeingmadetoprovidenetworkAIOseethenoteaboveabout
EvgeniyPolyakov'skeventbasedAIO.
In1999,SGIimplementedhighspeedAIOforLinux.Asofversion1.1,it'ssaidtoworkwellwith
bothdiskI/Oandsockets.Itseemstousekernelthreads.Itisstillusefulforpeoplewhocan'twaitfor
Ben'sAIOtosupportsockets.
TheO'ReillybookPOSIX.4:ProgrammingfortheRealWorldissaidtoincludeagoodintroductionto
aio.
Atutorialfortheearlier,nonstandard,aioimplementationonSolarisisonlineatSunsite.It'sprobably
worthalook,butkeepinmindyou'llneedtomentallyconvert"aioread"to"aio_read",etc.
NotethatAIOdoesn'tprovideawaytoopenfileswithoutblockingfordiskI/Oifyoucareaboutthe
sleepcausedbyopeningadiskfile,Linussuggestsyoushouldsimplydotheopen()inadifferentthread
ratherthanwishingforanaio_open()systemcall.
UnderWindows,asynchronousI/Oisassociatedwiththeterms"OverlappedI/O"andIOCPor"I/O
CompletionPort".Microsoft'sIOCPcombinestechniquesfromthepriorartlikeasynchronousI/O(like
aio_write)andqueuedcompletionnotification(likewhenusingtheaio_sigeventfieldwithaio_write)
withanewideaofholdingbacksomerequeststotrytokeepthenumberofrunningthreadsassociated
withasingleIOCPconstant.Formoreinformation,seeInsideI/OCompletionPortsbyMark
Russinovichatsysinternals.com,JeffreyRichter'sbook"ProgrammingServerSideApplicationsfor
MicrosoftWindows2000"(Amazon,MSPress),U.S.patent#06223207,orMSDN.

4.Serveoneclientwitheachserverthread
...andletread()andwrite()block.Hasthedisadvantageofusingawholestackframeforeachclient,
whichcostsmemory.ManyOS'salsohavetroublehandlingmorethanafewhundredthreads.Ifeach
threadgetsa2MBstack(notanuncommondefaultvalue),yourunoutof*virtualmemory*at(2^30/
2^21)=512threadsona32bitmachinewith1GBuseraccessibleVM(like,say,Linuxasnormally
shippedonx86).Youcanworkaroundthisbygivingeachthreadasmallerstack,butsincemostthread
librariesdon'tallowgrowingthreadstacksoncecreated,doingthismeansdesigningyourprogramto
minimizestackuse.Youcanalsoworkaroundthisbymovingtoa64bitprocessor.
ThethreadsupportinLinux,FreeBSD,andSolarisisimproving,and64bitprocessorsarejustaround
thecornerevenformainstreamusers.Perhapsinthenottoodistantfuture,thosewhopreferusingone
threadperclientwillbeabletousethatparadigmevenfor10000clients.Nevertheless,atthecurrent
time,ifyouactuallywanttosupportthatmanyclients,you'reprobablybetteroffusingsomeother
paradigm.
Foranunabashedlyprothreadviewpoint,seeWhyEventsAreABadIdea(forHighconcurrency
Servers)byvonBehren,Condit,andBrewer,UCB,presentedatHotOSIX.Anyonefromtheantithread
campcaretopointoutapaperthatrebutsthisone?:)
LinuxThreads
LinuxTheadsisthenameforthestandardLinuxthreadlibrary.Itisintegratedintoglibcsinceglibc2.0,
andismostlyPosixcompliant,butwithlessthanstellarperformanceandsignalsupport.
http://www.kegel.com/c10k.html#related

9/22

8/22/2015

TheC10Kproblem

NGPT:NextGenerationPosixThreadsforLinux
NGPTisaprojectstartedbyIBMtobringgoodPosixcompliantthreadsupporttoLinux.It'satstable
version2.2now,andworkswell...buttheNGPTteamhasannouncedthattheyareputtingtheNGPT
codebaseintosupportonlymodebecausetheyfeelit's"thebestwaytosupportthecommunityforthe
longterm".TheNGPTteamwillcontinueworkingtoimproveLinuxthreadsupport,butnowfocusedon
improvingNPTL.(KudostotheNGPTteamfortheirgoodworkandthegracefulwaytheyconcededto
NPTL.)
NPTL:NativePosixThreadLibraryforLinux
NPTLisaprojectbyUlrichDrepper(thebenevolentdict^H^H^H^Hmaintainerofglibc)andIngo
MolnartobringworldclassPosixthreadingsupporttoLinux.
Asof5October2003,NPTLisnowmergedintotheglibccvstreeasanaddondirectory(justlike
linuxthreads),soitwillalmostcertainlybereleasedalongwiththenextreleaseofglibc.
ThefirstmajordistributiontoincludeanearlysnapshotofNPTLwasRedHat9.(Thiswasabit
inconvenientforsomeusers,butsomebodyhadtobreaktheice...)
NPTLlinks:
MailinglistforNPTLdiscussion
NPTLsourcecode
InitialannouncementforNPTL
OriginalwhitepaperdescribingthegoalsforNPTL
RevisedwhitepaperdescribingthefinaldesignofNPTL
IngoMolnar'sfirstbenchmarkshowingitcouldhandle10^6threads
Ulrich'sbenchmarkcomparingperformanceofLinuxThreads,NPTL,andIBM'sNGPT.Itseems
toshowNPTLismuchfasterthanNGPT.
Here'smytryatdescribingthehistoryofNPTL(seealsoJerryCooperstein'sarticle):
InMarch2002,BillAbtoftheNGPTteam,theglibcmaintainerUlrichDrepper,andothersmettofigure
outwhattodoaboutLinuxThreads.Oneideathatcameoutofthemeetingwastoimprovemutex
performanceRustyRusselletalsubsequentlyimplementedfastuserspacemutexes(futexes)),whichare
nowusedbybothNGPTandNPTL.MostoftheattendeesfiguredNGPTshouldbemergedintoglibc.
UlrichDrepper,though,didn'tlikeNGPT,andfiguredhecoulddobetter.(Forthosewhohaveevertried
tocontributeapatchtoglibc,thismaynotcomeasabigsurprise:)Overthenextfewmonths,Ulrich
Drepper,IngoMolnar,andotherscontributedglibcandkernelchangesthatmakeupsomethingcalled
theNativePosixThreadsLibrary(NPTL).NPTLusesallthekernelenhancementsdesignedforNGPT,
andtakesadvantageofafewnewones.IngoMolnardescribedthekernelenhancementsasfollows:
WhileNPTLusesthethreekernelfeaturesintroducedbyNGPT:getpid()returnsPID,
CLONE_THREADandfutexesNPTLalsouses(andrelieson)amuchwidersetofnew
kernelfeatures,developedaspartofthisproject.
SomeoftheitemsNGPTintroducedintothekernelaround2.5.8gotmodified,cleanedup
andextended,suchasthreadgrouphandling(CLONE_THREAD).[theCLONE_THREAD
changeswhichimpactedNGPT'scompatibilitygotsyncedwiththeNGPTfolks,tomakesure
http://www.kegel.com/c10k.html#related

10/22

8/22/2015

TheC10Kproblem

NGPTdoesnotbreakinanyunacceptableway.]
ThekernelfeaturesdevelopedforandusedbyNPTLaredescribedinthedesignwhitepaper,
http://people.redhat.com/drepper/nptldesign.pdf...
Ashortlist:TLSsupport,variouscloneextensions(CLONE_SETTLS,CLONE_SETTID,
CLONE_CLEARTID),POSIXthreadsignalhandling,sys_exit()extension(releaseTID
futexuponVMrelease),thesys_exit_group()systemcall,sys_execve()enhancementsand
supportfordetachedthreads.
TherewasalsoworkputintoextendingthePIDspaceeg.procfscrasheddueto64KPID
assumptions,max_pid,andpidallocationscalabilitywork.Plusanumberofperformance
onlyimprovementsweredoneaswell.
Inessencethenewfeaturesareanocompromisesapproachto1:1threadingthekernel
nowhelpsineverythingwhereitcanimprovethreading,andwepreciselydotheminimally
necessarysetofcontextswitchesandkernelcallsforeverybasicthreadingprimitive.
OnebigdifferencebetweenthetwoisthatNPTLisa1:1threadingmodel,whereasNGPTisanM:N
threadingmodel(seebelow).Inspiteofthis,Ulrich'sinitialbenchmarksseemtoshowthatNPTLis
indeedmuchfasterthanNGPT.(TheNGPTteamislookingforwardtoseeingUlrich'sbenchmarkcode
toverifytheresult.)
FreeBSDthreadingsupport
FreeBSDsupportsbothLinuxThreadsandauserspacethreadinglibrary.Also,aM:Nimplementation
calledKSEwasintroducedinFreeBSD5.0.Foroneoverview,seewww.unobvious.com/bsd/freebsd
threads.html.
On25Mar2003,JeffRobersonpostedonfreebsdarch:
...ThankstothefoundationprovidedbyJulian,DavidXu,Mini,DanEischen,andeveryone
elsewhohasparticipatedwithKSEandlibpthreaddevelopmentMiniandIhavedeveloped
a1:1threadingimplementation.ThiscodeworksinparallelwithKSEanddoesnotbreakit
inanyway.ItactuallyhelpsbringM:Nthreadingcloserbytestingoutsharedbits....
AndinJuly2006,RobertWatsonproposedthatthe1:1threadingimplementationbecomethedefaultin
FreeBsd7.x:
Iknowthishasbeendiscussedinthepast,butIfiguredwith7.xtrundlingforward,itwas
timetothinkaboutitagain.Inbenchmarksformanycommonapplicationsandscenarios,
libthrdemonstratessignificantlybetterperformanceoverlibpthread...libthrisalso
implementedacrossalargernumberofourplatforms,andisalreadylibpthreadonseveral.
ThefirstrecommendationwemaketoMySQLandotherheavythreadusersis"Switchto
libthr",whichissuggestive,also!...Sothestrawmanproposalis:makelibthrthedefault
threadinglibraryon7.x.
NetBSDthreadingsupport
AccordingtoanotefromNoriyukiSoda:
http://www.kegel.com/c10k.html#related

11/22

8/22/2015

TheC10Kproblem

KernelsupportedM:NthreadlibrarybasedontheSchedulerActivationsmodelismerged
intoNetBSDcurrentonJan182003.
Fordetails,seeAnImplementationofSchedulerActivationsontheNetBSDOperatingSystemby
NathanJ.Williams,WasabiSystems,Inc.,presentedatFREENIX'02.
Solaristhreadingsupport
ThethreadsupportinSolarisisevolving...fromSolaris2toSolaris8,thedefaultthreadinglibraryused
anM:Nmodel,butSolaris9defaultsto1:1modelthreadsupport.SeeSun'smultithreadedprogramming
guideandSun'snoteaboutJavaandSolaristhreading.
JavathreadingsupportinJDK1.3.xandearlier
Asiswellknown,JavauptoJDK1.3.xdidnotsupportanymethodofhandlingnetworkconnections
otherthanonethreadperclient.Volanomarkisagoodmicrobenchmarkwhichmeasuresthroughputin
messsagespersecondatvariousnumbersofsimultaneousconnections.AsofMay2003,JDK1.3
implementationsfromvariousvendorsareinfactabletohandletenthousandsimultaneousconnections
albeitwithsignificantperformancedegradation.SeeTable4foranideaofwhichJVMscanhandle
10000connections,andhowperformancesuffersasthenumberofconnectionsincreases.
Note:1:1threadingvs.M:Nthreading
Thereisachoicewhenimplementingathreadinglibrary:youcaneitherputallthethreadingsupportin
thekernel(thisiscalledthe1:1threadingmodel),oryoucanmoveafairbitofitintouserspace(thisis
calledtheM:Nthreadingmodel).Atonepoint,M:Nwasthoughttobehigherperformance,butit'sso
complexthatit'shardtogetright,andmostpeoplearemovingawayfromit.
WhyIngoMolnarprefers1:1overM:N
Sunismovingto1:1threads
NGPTisanM:NthreadinglibraryforLinux.
AlthoughUlrichDrepperplannedtouseM:Nthreadsinthenewglibcthreadinglibrary,hehas
sinceswitchedtothe1:1threadingmodel.
MacOSXappearstouse1:1threading.
FreeBSDandNetBSDappeartostillbelieveinM:Nthreading...Theloneholdouts?Lookslike
freebsd7.0mightswitchto1:1threading(seeabove),soperhapsM:Nthreading'sbelievershave
finallybeenprovenwrongeverywhere.

5.Buildtheservercodeintothekernel
NovellandMicrosoftarebothsaidtohavedonethisatvarioustimes,atleastoneNFSimplementation
doesthis,khttpddoesthisforLinuxandstaticwebpages,and"TUX"(ThreadedlinUXwebserver)isa
blindinglyfastandflexiblekernelspaceHTTPserverbyIngoMolnarforLinux.Ingo'sSeptember1,
2000announcementsaysanalphaversionofTUXcanbedownloadedfrom
ftp://ftp.redhat.com/pub/redhat/tux,andexplainshowtojoinamailinglistformoreinfo.
Thelinuxkernellisthasbeendiscussingtheprosandconsofthisapproach,andtheconsensusseemsto
beinsteadofmovingwebserversintothekernel,thekernelshouldhavethesmallestpossiblehooks
addedtoimprovewebserverperformance.Thatway,otherkindsofserverscanbenefit.Seee.g.Zach
Brown'sremarksaboutuserlandvs.kernelhttpservers.Itappearsthatthe2.4linuxkernelprovides
http://www.kegel.com/c10k.html#related

12/22

8/22/2015

TheC10Kproblem

sufficientpowertouserprograms,astheX15serverrunsaboutasfastasTux,butdoesn'tuseanykernel
modifications.

BringtheTCPstackintouserspace
SeeforinstancethenetmappacketI/Oframework,andtheSandstormproofofconceptwebserverbased
onit.

Comments
RichardGoochhaswrittenapaperdiscussingI/Ooptions.
In2001,TimBrechtandMMichalOstrowskimeasuredvariousstrategiesforsimpleselectbasedservers.
Theirdataisworthalook.
In2003,TimBrechtpostedsourcecodeforuserver,asmallwebserverputtogetherfromseveralservers
writtenbyAbhishekChandra,DavidMosberger,DavidPariag,andMichalOstrowski.Itcanuseselect(),
poll(),epoll(),orsigio.
BackinMarch1999,DeanGaudetposted:
Ikeepgettingasked"whydon'tyouguysuseaselect/eventbasedmodellikeZeus?It's
clearlythefastest."...
Hisreasonsboileddownto"it'sreallyhard,andthepayoffisn'tclear".Withinafewmonths,though,it
becameclearthatpeoplewerewillingtoworkonit.
MarkRussinovichwroteaneditorialandanarticlediscussingI/Ostrategyissuesinthe2.2Linuxkernel.
Worthreading,evenheseemsmisinformedonsomepoints.Inparticular,heseemstothinkthatLinux
2.2'sasynchronousI/O(seeF_SETSIGabove)doesn'tnotifytheuserprocesswhendataisready,only
whennewconnectionsarrive.Thisseemslikeabizarremisunderstanding.Seealsocommentsonan
earlierdraft,IngoMolnar'srebuttalof30April1999,Russinovich'scommentsof2May1999,arebuttal
fromAlanCox,andvariouspoststolinuxkernel.IsuspecthewastryingtosaythatLinuxdoesn't
supportasynchronousdiskI/O,whichusedtobetrue,butnowthatSGIhasimplementedKAIO,it'snot
sotrueanymore.
Seethesepagesatsysinternals.comandMSDNforinformationon"completionports",whichhesaid
wereuniquetoNTinanutshell,win32's"overlappedI/O"turnedouttobetoolowleveltobe
convenient,anda"completionport"isawrapperthatprovidesaqueueofcompletionevents,plus
schedulingmagicthattriestokeepthenumberofrunningthreadsconstantbyallowingmorethreadsto
pickupcompletioneventsifotherthreadsthathadpickedupcompletioneventsfromthisportare
sleeping(perhapsdoingblockingI/O).
SeealsoOS/400'ssupportforI/Ocompletionports.
TherewasaninterestingdiscussiononlinuxkernelinSeptember1999titled">15,000Simultaneous
Connections"(andthesecondweekofthethread).Highlights:
EdHallpostedafewnotesonhisexperienceshe'sachieved>1000connects/secondonaUP
P2/333runningSolaris.Hiscodeusedasmallpoolofthreads(1or2perCPU)eachmanaginga
http://www.kegel.com/c10k.html#related

13/22

8/22/2015

TheC10Kproblem

largenumberofclientsusing"aneventbasedmodel".
MikeJagdispostedananalysisofpoll/selectoverhead,andsaid"Thecurrentselect/poll
implementationcanbeimprovedsignificantly,especiallyintheblockingcase,buttheoverhead
willstillincreasewiththenumberofdescriptorsbecauseselect/polldoesnot,andcannot,
rememberwhatdescriptorsareinteresting.ThiswouldbeeasytofixwithanewAPI.Suggestions
arewelcome..."
Mikepostedabouthisworkonimprovingselect()andpoll().
MikepostedabitaboutapossibleAPItoreplacepoll()/select():"Howabouta'devicelike'API
whereyouwrite'pollfdlike'structs,the'device'listensforeventsanddelivers'pollfdlike'structs
representingthemwhenyoureadit?..."
RogierWolffsuggestedusing"theAPIthatthedigitalguyssuggested",
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
JoergPommnitzpointedoutthatanynewAPIalongtheselinesshouldbeabletowaitfornotjust
filedescriptorevents,butalsosignalsandmaybeSYSVIPC.Oursynchronizationprimitives
shouldcertainlybeabletodowhatWin32'sWaitForMultipleObjectscan,atleast.
StephenTweedieassertedthatthecombinationofF_SETSIG,queuedrealtimesignals,and
sigwaitinfo()wasasupersetoftheAPIproposedin
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps.Healsomentionsthatyoukeepthesignal
blockedatalltimesifyou'reinterestedinperformanceinsteadofthesignalbeingdelivered
asynchronously,theprocessgrabsthenextonefromthequeuewithsigwaitinfo().
JaysonNordwickcomparedcompletionportswiththeF_SETSIGsynchronouseventmodel,and
concludedthey'reprettysimilar.
AlanCoxnotedthatanolderrevofSCT'sSIGIOpatchisincludedin2.3.18ac.
JordanMendelsonpostedsomeexamplecodeshowinghowtouseF_SETSIG.
StephenC.TweediecontinuedthecomparisonofcompletionportsandF_SETSIG,andnoted:
"Withasignaldequeuingmechanism,yourapplicationisgoingtogetsignalsdestinedforvarious
librarycomponentsiflibrariesareusingthesamemechanism,"butthelibrarycansetupitsown
signalhandler,sothisshouldn'taffecttheprogram(much).
DougRoyernotedthathe'dgotten100,000connectionsonSolaris2.6whilehewasworkingon
theSuncalendarserver.OtherschimedinwithestimatesofhowmuchRAMthatwouldrequireon
Linux,andwhatbottleneckswouldbehit.
Interestingreading!

Limitsonopenfilehandles
AnyUnix:thelimitssetbyulimitorsetrlimit.
Solaris:seetheSolarisFAQ,question3.46(orthereaboutstheyrenumberthequestions
periodically).
FreeBSD:
Edit/boot/loader.conf,addtheline
setkern.maxfiles=XXXX

whereXXXXisthedesiredsystemlimitonfiledescriptors,andreboot.Thankstoananonymous
reader,whowroteintosayhe'dachievedfarmorethan10000connectionsonFreeBSD4.3,and
says
"FWIW:Youcan'tactuallytunethemaximumnumberofconnectionsinFreeBSD
trivially,viasysctl....Youhavetodoitinthe/boot/loader.conffile.
http://www.kegel.com/c10k.html#related

14/22

8/22/2015

TheC10Kproblem

Thereasonforthisisthatthezalloci()callsforinitializingthesocketsandtcpcb
structureszonesoccursveryearlyinsystemstartup,inorderthatthezonebebothtype
stableandthatitbeswappable.
Youwillalsoneedtosetthenumberofmbufsmuchhigher,sinceyouwill(onan
unmodifiedkernel)chewuponembufperconnectionfortcptemplstructures,which
areusedtoimplementkeepalive."
Anotherreadersays
"AsofFreeBSD4.4,thetcptemplstructureisnolongerallocatedyounolongerhave
toworryaboutonembufbeingchewedupperconnection."
Seealso:
theFreeBSDhandbook
SYSCTLTUNING,LOADERTUNABLES,andKERNELCONFIGTUNINGin'man
tuning'
TheEffectsofTuningaFreeBSD4.3BoxforHighPerformance,DaemonNews,Aug2001
postfix.orgtuningnotes,coveringFreeBSD4.2and4.4
theMeasurementFactory'snotes,circaFreeBSD4.3
OpenBSD:Areadersays
"InOpenBSD,anadditionaltweakisrequiredtoincreasethenumberofopen
filehandlesavailableperprocess:theopenfilescurparameterin/etc/login.confneeds
tobeincreased.Youcanchangekern.maxfileseitherwithsysctlworinsysctl.conf
butithasnoeffect.Thismattersbecauseasshipped,thelogin.conflimitsareaquite
low64fornonprivilegedprocesses,128forprivileged."
Linux:SeeBodoBauer's/procdocumentation.On2.4kernels:
echo32768>/proc/sys/fs/filemax

increasesthesystemlimitonopenfiles,and
ulimitn32768

increasesthecurrentprocess'limit.
On2.2.xkernels,
echo32768>/proc/sys/fs/filemax
echo65536>/proc/sys/fs/inodemax

increasesthesystemlimitonopenfiles,and
ulimitn32768

increasesthecurrentprocess'limit.
IverifiedthataprocessonRedHat6.0(2.2.5orsopluspatches)canopenatleast31000file
descriptorsthisway.Anotherfellowhasverifiedthataprocesson2.2.12canopenatleast90000
filedescriptorsthisway(withappropriatelimits).Theupperboundseemstobeavailablememory.
StephenC.Tweediepostedabouthowtosetulimitlimitsgloballyorperuseratboottimeusing
initscriptandpam_limit.
Inolder2.2kernels,though,thenumberofopenfilesperprocessisstilllimitedto1024,evenwith
http://www.kegel.com/c10k.html#related

15/22

8/22/2015

TheC10Kproblem

theabovechanges.
SeealsoOskar's1998post,whichtalksabouttheperprocessandsystemwidelimitsonfile
descriptorsinthe2.0.36kernel.

Limitsonthreads
Onanyarchitecture,youmayneedtoreducetheamountofstackspaceallocatedforeachthreadtoavoid
runningoutofvirtualmemory.Youcansetthisatruntimewithpthread_attr_init()ifyou'reusing
pthreads.
Solaris:itsupportsasmanythreadsaswillfitinmemory,Ihear.
Linux2.6kernelswithNPTL:/proc/sys/vm/max_map_countmayneedtobeincreasedtogoabove
32000orsothreads.(You'llneedtouseverysmallstackthreadstogetanywherenearthatnumber
ofthreads,though,unlessyou'reona64bitprocessor.)SeetheNPTLmailinglist,e.g.thethread
withsubject"Cannotcreatemorethan32Kthreads?",formoreinfo.
Linux2.4:/proc/sys/kernel/threadsmaxisthemaxnumberofthreadsitdefaultsto2047onmy
RedHat8system.Youcansetincreasethisasusualbyechoingnewvaluesintothatfile,e.g.
"echo4000>/proc/sys/kernel/threadsmax"
Linux2.2:Eventhe2.2.13kernellimitsthenumberofthreads,atleastonIntel.Idon'tknowwhat
thelimitsareonotherarchitectures.Mingopostedapatchfor2.1.131onIntelthatremovedthis
limit.Itappearstobeintegratedinto2.3.20.
SeealsoVolano'sdetailedinstructionsforraisingfile,thread,andFD_SETlimitsinthe2.2kernel.
Wow.Thisdocumentstepsyouthroughalotofstuffthatwouldbehardtofigureoutyourself,but
issomewhatdated.
Java:SeeVolano'sdetailedbenchmarkinfo,plustheirinfoonhowtotunevarioussystemsto
handlelotsofthreads.

Javaissues
UpthroughJDK1.3,Java'sstandardnetworkinglibrariesmostlyofferedtheonethreadperclientmodel.
Therewasawaytodononblockingreads,butnowaytodononblockingwrites.
InMay2001,JDK1.4introducedthepackagejava.niotoprovidefullsupportfornonblockingI/O(and
someothergoodies).Seethereleasenotesforsomecaveats.TryitoutandgiveSunfeedback!
HP'sjavaalsoincludesaThreadPollingAPI.
In2000,MattWelshimplementednonblockingsocketsforJavahisperformancebenchmarksshowthat
theyhaveadvantagesoverblockingsocketsinservershandlingmany(upto10000)connections.His
classlibraryiscalledjavanbioit'spartoftheSandstormproject.Benchmarksshowingperformance
with10000connectionsareavailable.
SeealsoDeanGaudet'sessayonthesubjectofJava,networkI/O,andthreads,andthepaperbyMatt
Welshoneventsvs.workerthreads.
BeforeNIO,therewereseveralproposalsforimprovingJava'snetworkingAPIs:
MattWelsh'sJaguarsystemproposespreserializedobjects,newJavabytecodes,andmemory
http://www.kegel.com/c10k.html#related

16/22

8/22/2015

TheC10Kproblem

managementchangestoallowtheuseofasynchronousI/OwithJava.
InterfacingJavatotheVirtualInterfaceArchitecture,byCC.ChangandT.vonEicken,proposes
memorymanagementchangestoallowtheuseofasynchronousI/OwithJava.
JSR51wastheSunprojectthatcameupwiththejava.niopackage.MattWelshparticipated(who
saysSundoesn'tlisten?).

Othertips
ZeroCopy
Normally,datagetscopiedmanytimesonitswayfromheretothere.Anyschemethateliminates
thesecopiestothebarephysicalminimumiscalled"zerocopy".
ThomasOgrisegg'szerocopysendpatchformmapedfilesunderLinux2.4.172.4.20.
Claimsit'sfasterthansendfile().
IOLiteisaproposalforasetofI/Oprimitivesthatgetsridoftheneedformanycopies.
AlanCoxnotedthatzerocopyissometimesnotworththetroublebackin1999.(Hedidlike
sendfile(),though.)
IngoimplementedaformofzerocopyTCPinthe2.4kernelforTUX1.0inJuly2000,and
sayshe'llmakeitavailabletouserspacesoon.
DrewGallatinandRobertPiccohaveaddedsomezerocopyfeaturestoFreeBSDtheidea
seemstobethatifyoucallwrite()orread()onasocket,thepointerispagealigned,andthe
amountofdatatransferredisatleastapage,*and*youdon'timmediatelyreusethebuffer,
memorymanagementtrickswillbeusedtoavoidcopies.Butseefollowupstothismessage
onlinuxkernelforpeople'smisgivingsaboutthespeedofthosememorymanagementtricks.
AccordingtoanotefromNoriyukiSoda:
SendingsidezerocopyissupportedsinceNetBSD1.6releasebyspecifying
"SOSEND_LOAN"kerneloption.ThisoptionisnowdefaultonNetBSDcurrent
(youcandisablethisfeaturebyspecifying"SOSEND_NO_LOAN"inthekernel
optiononNetBSD_current).Withthisfeature,zerocopyisautomatically
enabled,ifdatamorethan4096bytesarespecifiedasdatatobesent.
Thesendfile()systemcallcanimplementzerocopynetworking.
Thesendfile()functioninLinuxandFreeBSDletsyoutellthekerneltosendpartorallofa
file.ThisletstheOSdoitasefficientlyaspossible.Itcanbeusedequallywellinservers
usingthreadsorserversusingnonblockingI/O.(InLinux,it'spoorlydocumentedatthe
momentuse_syscall4tocallit.AndiKleeniswritingnewmanpagesthatcoverthis.See
alsoExploringThesendfileSystemCallbyJeffTranterinLinuxGazetteissue91.)Rumor
hasit,ftp.cdrom.combenefittednoticeablyfromsendfile().
Azerocopyimplementationofsendfile()isonitswayforthe2.4kernel.SeeLWNJan25
2001.
Onedeveloperusingsendfile()withFreebsdreportsthatusingPOLLWRBANDinsteadof
POLLOUTmakesabigdifference.
Solaris8(asoftheJuly2001update)hasanewsystemcall'sendfilev'.Acopyoftheman
pageishere..TheSolaris87/01releasenotesalsomentionit.Isuspectthatthiswillbemost
usefulwhensendingtoasocketinblockingmodeit'dbeabitofapaintousewitha
nonblockingsocket.
http://www.kegel.com/c10k.html#related

17/22

8/22/2015

TheC10Kproblem

Avoidsmallframesbyusingwritev(orTCP_CORK)
AnewsocketoptionunderLinux,TCP_CORK,tellsthekerneltoavoidsendingpartialframes,
whichhelpsabite.g.whentherearelotsoflittlewrite()callsyoucan'tbundletogetherforsome
reason.Unsettingtheoptionflushesthebuffer.Bettertousewritev(),though...
SeeLWNJan252001forasummaryofsomeveryinterestingdiscussionsonlinuxkernelabout
TCP_CORKandapossiblealternativeMSG_MORE.
Behavesensiblyonoverload.
[Provos,Lever,andTweedie2000]notesthatdroppingincomingconnectionswhentheserveris
overloadedimprovedtheshapeoftheperformancecurve,andreducedtheoverallerrorrate.They
usedasmoothedversionof"numberofclientswithI/Oready"asameasureofoverload.This
techniqueshouldbeeasilyapplicabletoserverswrittenwithselect,poll,oranysystemcallthat
returnsacountofreadinesseventspercall(e.g./dev/pollorsigtimedwait4()).
SomeprogramscanbenefitfromusingnonPosixthreads.
Notallthreadsarecreatedequal.Theclone()functioninLinux(anditsfriendsinotheroperating
systems)letsyoucreateathreadthathasitsowncurrentworkingdirectory,forinstance,whichcan
beveryhelpfulwhenimplementinganftpserver.SeeHoserFTPdforanexampleoftheuseof
nativethreadsratherthanpthreads.
Cachingyourowndatacansometimesbeawin.
"Re:fixforhybridserverproblems"byVivekSadanandaPai(vivek@cs.rice.edu)onnewhttpd,
May9th,states:
"I'vecomparedtherawperformanceofaselectbasedserverwithamultipleprocess
serveronbothFreeBSDandSolaris/x86.Onmicrobenchmarks,there'sonlya
marginaldifferenceinperformancestemmingfromthesoftwarearchitecture.Thebig
performancewinforselectbasedserversstemsfromdoingapplicationlevelcaching.
Whilemultipleprocessserverscandoitatahighercost,it'shardertogetthesame
benefitsonrealworkloads(vsmicrobenchmarks).I'llbepresentingthose
measurementsaspartofapaperthat'llappearatthenextUsenixconference.Ifyou've
gotpostscript,thepaperisavailableathttp://www.cs.rice.edu/~vivek/flash99/"

Otherlimits
Oldsystemlibrariesmightuse16bitvariablestoholdfilehandles,whichcausestroubleabove
32767handles.glibc2.1shouldbeok.
Manysystemsuse16bitvariablestoholdprocessorthreadid's.Itwouldbeinterestingtoportthe
VolanoscalabilitybenchmarktoC,andseewhattheupperlimitonnumberofthreadsisforthe
variousoperatingsystems.
Toomuchthreadlocalmemoryispreallocatedbysomeoperatingsystemsifeachthreadgets
1MB,andtotalVMspaceis2GB,thatcreatesanupperlimitof2000threads.
Lookattheperformancecomparisongraphatthebottomof
http://www.acme.com/software/thttpd/benchmarks.html.Noticehowvariousservershavetrouble
above128connections,evenonSolaris2.6?Anyonewhofiguresoutwhy,letmeknow.
Note:iftheTCPstackhasabugthatcausesashort(200ms)delayatSYNorFINtime,asLinux
2.2.02.2.6had,andtheOSorhttpdaemonhasahardlimitonthenumberofconnectionsopen,
youwouldexpectexactlythisbehavior.Theremaybeothercauses.

KernelIssues
http://www.kegel.com/c10k.html#related

18/22

8/22/2015

TheC10Kproblem

ForLinux,itlookslikekernelbottlenecksarebeingfixedconstantly.SeeLinuxWeeklyNews,Kernel
Traffic,theLinuxKernelmailinglist,andmyMindcraftReduxpage.
InMarch1999,MicrosoftsponsoredabenchmarkcomparingNTtoLinuxatservinglargenumbersof
httpandsmbclients,inwhichtheyfailedtoseegoodresultsfromLinux.Seealsomyarticleon
Mindcraft'sApril1999Benchmarksformoreinfo.
SeealsoTheLinuxScalabilityProject.They'redoinginterestingwork,includingNielsProvos'hinting
pollpatch,andsomeworkonthethunderingherdproblem.
SeealsoMikeJagdis'workonimprovingselect()andpoll()here'sMike'spostaboutit.
MohitAron(aron@cs.rice.edu)writesthatratebasedclockinginTCPcanimproveHTTPresponsetime
over'slow'connectionsby80%.

MeasuringServerPerformance
Twotestsinparticulararesimple,interesting,andhard:
1. rawconnectionspersecond(howmany512bytefilespersecondcanyouserve?)
2. totaltransferrateonlargefileswithmanyslowclients(howmany28.8kmodemclientscan
simultaneouslydownloadfromyourserverbeforeperformancegoestopot?)
JefPoskanzerhaspublishedbenchmarkscomparingmanywebservers.See
http://www.acme.com/software/thttpd/benchmarks.htmlforhisresults.
IalsohaveafewoldnotesaboutcomparingthttpdtoApachethatmaybeofinteresttobeginners.
ChuckLeverkeepsremindingusaboutBangaandDruschel'spaperonwebserverbenchmarking.It's
wortharead.
IBMhasanexcellentpapertitledJavaserverbenchmarks[Bayloretal,2000].It'swortharead.

Examples
Nginxisawebserverthatuseswhateverhighefficiencynetworkeventmechanismisavailableonthe
targetOS.It'sgettingpopularthereareeventwobooksaboutit.

Interestingselect()basedservers
thttpdVerysimple.Usesasingleprocess.Ithasgoodperformance,butdoesn'tscalewiththe
numberofCPU's.Canalsousekqueue.
mathopd.Similartothttpd.
fhttpd
boa
Roxen
Zeus,acommercialserverthattriestobetheabsolutefastest.Seetheirtuningguide.
TheothernonJavaserverslistedathttp://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
http://www.kegel.com/c10k.html#related

19/22

8/22/2015

TheC10Kproblem

FlashLitewebserverusingIOLite.
Flash:AnefficientandportableWebserverusesselect(),mmap(),mincore()
TheFlashwebserverasof2003usesselect(),modifiedsendfile(),asyncopen()
xitamiusesselect()toimplementitsownthreadabstractionforportabilitytosystemswithout
threads.
MedusaaserverwritingtoolkitinPythonthattriestodeliververyhighperformance.
userverasmallhttpserverthatcanuseselect,poll,epoll,orsigio

Interesting/dev/pollbasedservers
N.Provos,C.Lever,"ScalableNetworkI/OinLinux,"May,2000.[FREENIXtrack,Proc.
USENIX2000,SanDiego,California(June,2000).]Describesaversionofthttpdmodifiedto
support/dev/poll.Performanceiscomparedwithphhttpd.

Interestingepollbasedservers
ribs2
cmogstoredusesepoll/kqueueformostnetworking,threadsfordiskandaccept4

Interestingkqueue()basedservers
thttpd(asofversion2.21?)
AdrianChaddsays"I'mdoingalotofworktomakesquidactuallyLIKEakqueueIOsystem"it's
anofficialSquidsubprojectseehttp://squid.sourceforge.net/projects.html#commloops.(Thisis
apparentlynewerthanBenno'spatch.)

Interestingrealtimesignalbasedservers
Chromium'sX15.Thisusesthe2.4kernel'sSIGIOfeaturetogetherwithsendfile()and
TCP_CORK,andreportedlyachieveshigherspeedthanevenTUX.Thesourceisavailableundera
communitysource(notopensource)license.SeetheoriginalannouncementbyFabioRiccardi.
ZachBrown'sphhttpd"aquickwebserverthatwaswrittentoshowcasethesigio/siginfoevent
model.considerthiscodehighlyexperimentalandyourselfhighlymentalifyoutryanduseitina
productionenvironment."Usesthesiginfofeaturesof2.3.21orlater,andincludestheneeded
patchesforearlierkernels.Rumoredtobeevenfasterthankhttpd.Seehispostof31May1999for
somenotes.

Interestingthreadbasedservers
HoserFTPD.Seetheirbenchmarkpage.
PeterEriksson'sphttpdand
pftpd
TheJavabasedserverslistedathttp://www.acme.com/software/thttpd/benchmarks.html
Sun'sJavaWebServer(whichhasbeenreportedtohandle500simultaneousclients)

Interestinginkernelservers
http://www.kegel.com/c10k.html#related

20/22

8/22/2015

TheC10Kproblem

khttpd
"TUX"(ThreadedlinUXwebserver)byIngoMolnaretal.For2.4kernel.

Otherinterestinglinks
JeffDarcy'snotesonhighperformanceserverdesign
Ericsson'sARIESprojectbenchmarkresultsforApache1vs.Apache2vs.Tomcaton1to12
processors
Prof.PeterLadkin'sWebServerPerformancepage.
Novell'sFastCacheclaims10000hitspersecond.Quitetheprettyperformancegraph.
RikvanRiel'sLinuxPerformanceTuningsite

Translations
BelorussiantranslationprovidedbyPatricConradatUcallweconn

Changelog
2011/07/21
Addednginx.org
$Log:c10k.html,v$
Revision1.2122006/09/0214:52:13dank
addedasio
Revision1.2112006/07/2710:28:58dank
LinktoCalHenderson'sbook.
Revision1.2102006/07/2710:18:58dank
Listifypolyakovlinks,addDrepper'snewproposal,notethatFreeBSD7mightmoveto1:1
Revision1.2092006/07/1315:07:03dank
linktoScale!library,updatedPolyakovlinks
Revision1.2082006/07/1314:50:29dank
LinktoPolyakov'spatches
Revision1.2072003/11/0308:09:39dank
LinktoLinus'smessagedeprecatingtheideaofaio_open
Revision1.2062003/11/0307:44:34dank
linktouserver
Revision1.2052003/11/0306:55:26dank
LinktoVivekPei'snewFlashpaper,mentiongreatspecweb99score

Copyright19992014DanKegel
dank@kegel.com
Lastupdated:5February2014
[Returntowww.kegel.com]
http://www.kegel.com/c10k.html#related

21/22

8/22/2015

http://www.kegel.com/c10k.html#related

TheC10Kproblem

22/22

You might also like