Toru Nishimura Sanctum Networks, Pvt. Ltd. nisimura@sanctumnetworks.com "b!trat Many-core processor is an attractive platform to run a general purpose OS like NetBSD. We, a team in Sanctum Networks, ported NetBSD .! to "#it $%&W many-core processor named '&%(-)*. &n t+is paper we introduce distinctive features of '&%(-)* in some dept+ as t+e product remains less known in our engineering community. NetBSD porting was made well and smoot+ t+an anticipated. We reali,ed t+at NetBSD .! is a mature SM- OS w+ic+ provides streamlined kernel structure and offers ric+ set of kernel .-& specifically designed for large degree SM-, #eyond /0 processor, configuration. .s a part of conclusion we mention a#out some of '&%(-)* NetBSD application area w+ic+ we1re willing to #uild. #$ Pro%et o&t'oo( an) ti*e 'ine '+is porting pro2ect was initiated in mid 3une 0!40 at 'okyo. '+e goal was to ac+ieve full SM- capa#ilities on t+e / core '&%(-)* processor and understand t+e suita#ility of t+e '&%( arc+itecture for various application development. our porting target is 'ilera (mpower 45 computer wit+ / core '&%(-)* processor. development environment wit+ cross compiler was made at early 3uly 0!40. geograp+ically separated two )&' repositories in pus+-pull sync+roni,ing. 3apan side +osts are in Sakura $-S and .ma,on (60. kernel image linking completed in late 3uly 0!40. &t contained a lot of sta# code guaranteed not to work. single core kernel was successful in running ramdisk sysinst program at 0!40-4!-/!. since t+en, kernel sta#ility, )#( driver and SM- +as #een persuaded. as of Marc+ 0!4/ porting pro2ect is going active. . num#er of functionalities, in particular ones for '&%(-)* uni7ue features, are under development. +$ TILE-Gx feat&re! '&%(-)* design was invented #y an M&' professor, Dr. .nant .garwal. &t1s t+e latest incarnation of a long time researc+ since 489!s. '&%(-)* is t+e t+ird generation product. '+ere are two successors, '&%(" and '&%(-ro w+ic+ are #ased on t+e same '&%( arc+itecture approac+. :ollowing to two /0#it designs, t+e t+ird generation model is made "#it processor. '&%( arc+itecture emp+asi,es t+e scala#ility and low power wit+ a uni7ue on-c+ip inter-connection tec+nology. '&%(-)* product family +as 8 ; 4!! core configuration. (ac+ of core is laid out in tiles-on-wall fas+ion. 6ore runs at 4.0)<, clock. 5nder work load wit+ modest network activity / core '&%(-)* is said to ac+ieve 0=-/!W power consumption in total. '&%(-)* is a "#it processor. &t +as "#it integer and "#it floating point operation, "#it register and "#it address space #y "#it pointer. '+ese simple c+aracteristics are 7uite familiar to any of plain old 5N&> programmers since t+e time w+en M&-S ?"!!! was introduced at early 488!s. +$#$ In!tr&tion !et feat&re '&%( instruction set arc+itecture is / way $%&W. 'wo or t+ree instructions are in a single "#it word w+ic+ is in turn called @instruction #undle.A '&%( can run up to t+ree instructions simultaneously. '&%(-)* +as si*ty four "#it register file. 'a#le 0-4 s+ows register definition. =9 of t+em, including two +ardwired ,ero registers, are general purpose. &t contains t+read pointer tp register to facilitate t+read programming. &t s+ows .B& to define t+e common register usage program oug+t to follow. regi!ter *ne*oni ty,e &!age ! - 8 r- -r. saved #y caller argumentsBreturn values 4! - 08 r#- - r+. saved #y caller @temporaryA /! - =4 r/- - r0# saved #y callee @safe across callA =0 r0+ saved #y callee frame pointer =/ t, dedicated t+read pointer =" !, dedicated stack pointer == 'r saved #y callee return address = !n always ,ero =C =9 =8 ! 4 0 i)n- i)n# &)n- &)n# &)n+ &)n/ onc+ip network communication &BO Dynamic Network ! &BO Dynamic Network 4 5ser Dynamic Network ! 5ser Dynamic Network 4 5ser Dynamic Network 0 5ser Dynamic Network / / 1ero always ,ero Tab'e +-# regi!ter a!!ign*ent for TILE-Gx "BI Note t+at '&%( offers a rat+er large num#er of, 4!, arguments in register for function call. '+e return values are placed in t+e same set of register for arguments. '+is means r! register content will #e destroyed 7uite often w+en a function e*its. Better to remind it as one of '&%( de#ugging tips. Si* registers are reserved for inter-core communication via on-c+ip network named 5DN and &DN. '+ese registers work as :&:O ports muc+ like :S% @fast- simple*-linkA in MicroBla,e processor. ?eading from empty register or writing on occupied register may cause t+e processor to stall until condition meets. ?egisters are s+ared resource among $%&W concurrent e*ecution flow. (ac+ su#routine +as a single entry point and a single e*it. No parallel su#routines are in action. ?egister conflicts in parallel e*ecution is considered program error. &t #rings undefined B une*pected values in register file. ?egister conflict avoidance is t+e programmer1s responsi#ility. '&%( instruction +as two different formatsD >-format for 0 instruction in a #undle and E-format for / instruction in a #undle. Basic arit+metics is done in / operand form. .s register file is " in si,e, /-operand instruction re7uires 49#it F * / for register designation plus some more for opcode. /-in-4 "#it #undle is considered rat+er tig+t encoding, +owever, contri#utes instruction density muc+. 0 register arit+metics +ave signed 9#it immediate or signed 4#it immediate value. '+e latter instruction #elongs to 0-in-4 #undle >-format as it needs to +ave longer encoding. 5noccupied instruction slot in a #undle is filled wit+ NO- w+ic+ works as a flowing #u##le in e*ecution pipeline. '+e most nota#le difference from conventional 6&S6 B ?&S6 instruction set is t+e lack of @register indirect wit+ offsetA addressing mode. '&%( +as no @%W ?/, !*4C9G?0HA style memory access. '+is means t+at local varia#le on stack andBor G6 languageH struct mem#er must #e accessed t+roug+ an e*plicit pointer in a temporary register to refer t+e target address. &t1s a stark contrast to t+e case w+ere @register indirect wit+ offset@ addressing mode can ac+ieve load B store operation wit+ @#ase register I immediate value offsetA very +andy for local varia#le and struct mem#er. '+is is anot+er tip for '&%( programming to remem#er. +$+$ Ot2er &!ef&' in!tr&tion '&%( instruction set +as an ort+ogonal set of atomic mat+ B lock instructions. fetc+add .tomic addition fetc+and .tomic logical .ND fetc+or .tomic logical O? cmpe*c+ compare-and-swap Tab'e +-+ ato*i in!tr&tion! .ll of t+em +ave two variations for 9#yte operation and "#yte operation. '+ese instructions are comforta#ly useful to implement NetBSD atomicJopsG/H routines. '+ey are well defined set of M--safe atomic operations and widely used in SM- NetBSD kernel construct andBor parallel programming li#rary like pt+readG/H. 6mpe*c+ instruction is for 6.S @compare-and-swapA or '.S @test-and-setA operation. &t works like as &ntel cmp*c+g instruction. Note t+at it1s not #ased on %%BS6 sync+roni,e model found in M&-S, .lp+a, -ower-6 and .?M". '&%( cmpe*c+ works wit+ accompanying @6mp$alueA S-? register. %ocking primitives can #e implemented wit+ it in usual manner. '&%(-)* +as ric+ set of DS- and S&MD instruction. &t also +as some fancy instructions. . set of #it field operations, 6%K Gcount-leading-,erosH and 6?6/0 polynomial mat+ for +as+ing B c+ecksum and so on. '&%(-)* floating point mat+ does not +ave dedicated register set. :- instruction uses )- registers for source B destination operands. +$/$ "))re!! !,ae '&%(-)* +as "0#it effective address #it out of "#it $. virtual address pointer. $irtual address is separated in upper 0'B space and lower 0'B space. '+ere is a large void in #etween. $.L/M"4N is eit+er of all-! or all-4, t+at is, sign e*tended from $.O"4P value. '&%(-)* +as no M&-S QS()! B QS()4 B >Q-<ES like address segment e*ists to distinguis+ cac+e nature. Software is in c+arge of address in+a#itation and cac+e nature control wit+ +elp of smart '%B usage. +$4$ Layere) ,rotetion '&%( arc+itecture provides four level protection sc+eme. %evel is ranging from ! least protected to / most protected. &t allows to #uild layered protection domains w+ic+ run protected programs in eac+ level. -%! 5ser applications -%4 )uest OS -%0 'ilera <ypervisor -%/ @ virtual mac+ine monitorA Gw+o knows w+at really it is.H Tab'e +-/ TILE ,rotetion 'e3e' -rogram runs in low order protection level is in+i#ited to touc+ resources in +ig+t order level. (ac+ core runs one of four protection level. 6urrent protection level of individual core is called 6-%. 6ontrol transfer is done #y e*ecuting dedicated instructionD swint!, swint4, swint0, swint/. NetBSD uses swint4 instruction for system call to #e issued #y applications programs. '+ere is a large set of S-? registers. mfspr B mtspr instructions operate t+em. S-? num#er is encoded in 4"#it. Most of S-? registers +ave t+eir own M% @minimal protection levelA value to ar#itrate w+ic+ level G! ; /H of program can access to. M-% is t+e #asis of layered protection for '&%( runtime environment. +$0$ TLB an) TSB '%B plays a central role in '&%( arc+itecture. &n '&%( arc+itecture '%B dost not 2ust make $M virtual memory possi#le #ut also reali,es c+ip-wide glo#al cac+e co+erency. '&%( '%B entry is designed to #e multi-core aware. '%B entry optionally +olds t+e location of core in c+ip Gin >-E coordinateH to track and identify +ow '%B entry to tell $.--. mapping is tied wit+ a specific core. %ike as most of modern processors, '&%( '%B is software managed. '&%(-)* +as independent '%B stores for instruction and dataD 4 entry i'%B and /0 entry d'%B. Note t+at '%B is a s+ared resource among programs w+ic+ run in different protection domain. '+e "0#it $. space is also s+ared among t+em. 'ilera <ypervisor reserves some of '%B entries for its own. ?emaining is free for guest OS and application programs to use. '+e '%B management strategy is modeled after S-.?6 processor. '&%( uses @'SBA and @''(A nomenclature for t+e very same purposes. 'SB @translation store #ufferA is a software e*tension of '%B. 'SB +olds a super set of '%B in main memory. &t works as a staging area to in2ect '%B entry into processor1s i'%B or d'%B. <$ is in c+arge for '%B miss +andling. &t always consults wit+ 'SB content in action. )uest OS can only operate 'SB store. .s '%B is one of +ig+ly sensitive s+ared resource among various programs, guest OS can not make access '%B. 'SB is normally reserved inside protected guest OS memory area. '&%( 'SB is a unified one to +old i'%B entries and d'%B entries. '+e approac+ is different from S-.?6" w+ic+ +as i'SB and d'SB in parallel. ''( @translation ta#le entryA is t+e software defined intermediate format of '%B entry. On '%B miss <$ takes control to run '%B refill operation. &t searc+es first t+e offending '%B entry in 'SB store. &f t+e target entry is found, <$ in2ects it to eit+er of i'%B or d'%B and complete refill operation. &f <$ finds 'SB +as no suc+ entry, t+en it posts a re7uest for guest OS to come in and solve t+is @'SB missA condition. )uest OS, in turn, responds to t+e '%B miss e*ception identifying it as genuine access error or recovera#le fault condition. '+e rest of operation is identical to popular software managed '%B processors. &f guest OS finds t+e e*ception is true '%B refill case, it adds t+e offending '%B entry into 'SB store and returns. <$ will take care t+e refilling. &f guest OS finds t+e e*ception access error or protection violation, it performs its way to +andle t+e cases. '&%( +as .S&D @address space identifier.A .S&D is to improve '%B +it ratio, t+at is, #etter $.-P-. translation efficiency. &t1s as normal as and identical to ot+er .S&D processors. '&%( .S&D is 9#it, offering 0= individual address spaces to #e distinguis+ed for '%B lookup operation. Some literatures incorrectly mention t+at .S&D is an e*tension of $., like saying it reali,es concatenated 9 I "0 address #it. .S&D is to virtuali,e '%B, or to make imaginary multiple '%B stores w+ic+ are num#ered and iterated #y .S&D. .S&D demands a smarter $M to operate. '+is topic will #e discussed in a later section. .s ot+er processors do, '&%( processor +andles many kinds of interrupt B e*ception. Device async+ronously posts variety of re7uests and different types of e*ception +appens w+ile a processor is in action. '&%( uses &-& @inter-processor interruptA not only for pure inter- processor messaging, #ut also for &BO device interrupt notification. .s '&%( integrated on-c+ip devices are located apart of core and notification comes across on- c+ip network, it1d #e reasona#le to use &-& laminating many into a single form. +$6$ iMe!2 on-2i, inter-onnet -rocessing core is laid out in a tiles-covering-wall fas+ion wit+ mes+ s+ape inter-connect to couple eac+ ot+er. &nter-connect +as >-E B street-avenue like layout. .t eac+ crossing is an independent switc+ processor to tie a computing node wit+ t+e entire switc+ network. 'ilera names it iMes+ tec+nology. Switc+ processor is 4#it ?&S6 to run low latency and +ig+ #andwidt+ switc+ing function t+roug+ limited num#er of signal connections. Besides of " pat+s for N, (, S and W directions to neig+#oring switc+es, one switc+ data-pat+ is coupled wit+ processor1s %0 cac+e. Data stream travels t+roug+ %0 first, t+en eit+er of %4 i6ac+e or d6ac+e reac+ing to a processing core. '+e inter-connect offers 5DN @5ser Dynamic NetworkA and &DN @&BO Dynamic NetworkA for general purpose on- c+ip streaming and messaging communication. 'otal register of '&%(-)* processor are assigned to accommodate t+e ease of programming. &t s+ould #e reminded t+at iMes+ does not implement nor enforce any kind of @smart network topology.A '+ere was a num#er of massively-parallel multi-processor super computers #uilt from time to time. .ll of t+em more-or- less persuaded a smarter topology for processor inter- connect to maintain low-latency and +ig+ #andwidt+. Nota#le e*amples are 6ray '/D and Si6orte* S6=9/0. '/D +ad a /-dimensional @torusA grap+ topology to make .lp+a processors tig+tly coupled eac+ ot+er. S6=/90 +ad @Qaut, grap+A topology to inter-connect -core M&-S" processor wit+ t+e +elp wit+ #uilt-in DM. engine to talk wit+ %0 cac+es and &BO devices. &n '&%( arc+itecture on-c+ip inter-connect is software defined. Switc+ processor can program t+e network topology to adapt varying demands. &n t+is way, '&%( arc+itecture can maintain t+e fle*i#ility and t+e scala#ility in parallel. &t1s unlikely @topology optimi,edA super computers can ac+ieve #ot+ natures in #alance. iMes+ .-& is provided to make finer control over on- c+ip network. 6ores can #e partitioned into groups w+ic+ work parallel as if t+ey are islands. '+is feature is implemented #y switc+ network programma#ility Wit+ +elp #y @topology-awareA and @cac+e attri#ute awareA '%B entries, iMes+ acts a central role for cac+e co+erency. +$4$ Ca2e )e!ign an) feat&re (ac+ core +as /0QB i6ac+e, /0QB d6ac+e and 0=QB iB d com#ined %0 cac+e. (it+er of %4 cac+e +as $&-' @$irtual &nde* and -+ysical 'agA nature. %4 i6ac+e /0QB, 0 way associative, "B line si,e. %4 d6ac+e /0QB, 0 way associative, "B line si,e, write-t+roug+. %0 cac+e 0=QB, iBd com#ined. 9 way associative, "B line si,e, write-#ack. Tab'e +-4 a2e 2arateri!ti! Some '&%( processor literatures mention to @co+erent %/ cac+e.A &t1s some+ow imprecise. '+e %/ functionality is ac+ieved #y a group of %0 cac+e. '+e sc+eme is called @cac+e +oming.A %et us start t+e e*planation. '&%( %4 cac+e is inclusive to %0. %4 +olds su#set of %0 contents at any moment. %0 miss +appens w+en offending cac+e line data is not found in %0. 6ore asks a#out t+e missing cac+e line data to @neig+#oring coresA w+ic+ are grouped #y <$ for a single OS instance. &f found t+ere, cac+e line data is transferred to re7uesters %0 cac+e. :oreign %0 cac+es work as an e*tension of local %0 cac+e. &n ot+er words, a group of cores s+are t+eir %0 cac+e contents eac+ ot+er. '+is sc+eme is named @cac+e +omingA and 'ilera calls t+e group of %0 cac+e as @co+erent %/A cac+e. / core )* processor +as @8MB co+erent %/A F /* 0=QB %0. 6ac+e lines can #e populated sparsely among different %0 to improve t+e cac+e efficiency. %/ cac+e +oming is one page attri#utes. &t1s controlla#le #y per-page #asis. /$ TILE-Gx on-2i, )e3ie! integrated multiple DDR memory controller 0 controllers in / B4 core models, 4 in 8 core model. '&%(-ro, t+e successor of '&%(-)*, " core model +ad four DD?0 memory controller on c+ip. Wit+ dual controller configuration, memory can #e driven in interleaved fas+ion. mPP! packet classi"ier &t1s a programma#le intelligent packet engine. &t offers @frame parseA function to run @sieve-to-forwardA classification on incoming (t+ernet frame stream at line speed. m-&-( is tig+tly integrated wit+ )#( B 4!) (t+ernet network interface. "* 4!) ports are availa#le in / core model. (ac+ port can #e reprogrammed to +ost "* )#( network interface. )#(-only ports are also availa#le in 4 B 8 core model. m-&-( +as local #uffer memory to +andle incoming and outgoing (t+ernet frames. m-&-( can perform load #alancing to distri#ute ingress frames to cores. 6ore #inds m-&-( device register set to a particular virtual address wit+ a designated d'%B entry for control. m-&-( in turn +olds an &BO '%B entry to access data w+ic+ resides in target G;accelerating application or guest OSH address space so t+at it can understand $.-P-. translation for frame data and accompanying descriptors. m-&-( +as its own /0#it instruction set. . special )66 toolc+ain is provided to program it. two or t+ree operand instruction. /0* /0#it register fileD 00 of t+em are general purpose. -rivate S-? registers wit+ mfspr and mtspr to use. #i$% crypto and compression engine &t1s a standalone computing processor populated inside '&%(-)*. Multiple Mi6. processors are on a single )*. Mi6. can copy data w+ile encrypting and compressing operation in action. &t1s a streaming operation. 6ore #inds Mi6. device register set to a particular virtual address wit+ a designated d'%B entry for control. Mi6. in turn +olds an &BO '%B entry to access data w+ic+ resides in target address space so t+at it can understand $.-P-. translation for crypto B compression data. $onventional &' devices '+ere are some conventional &BO devices like -6&e, 5SB0.! and &06BS-& in our porting target computer. -6&e controller works in eit+er root-comple* G+ostH or end-point GdeviceH mode. 5SB0.! is used for multiple purpose. &t works as virtual console w+ile in development and de#ug. &t can also in2ect a #inary image to )* processor to run. '+e #inary image consists of #oot programs, <$ image and guest OS in predefined format. 4$ Ti'era 5y,er3i!or <$ utili,es '&%( protection level feature. )uest OS +as +eavily limited access to S-? registers. Only +andle num#er of S-? registers allowed to used #y )uest OS. <$ is populated at t+e 4MB area in t+e upper 0'B space wit+ a +ardwired '%B entries. <$ +as great control over t+e entire '&%( processor comple*. <$ makes cores into groups w+ic+ are manged in M * N rectangle s+ape to form OS instance. <$ assigns &BO devices to particular instances wit+ &BO '%B entries. 'ilera calls t+e sc+eme MM&O @memory mapped &OA sc+eme w+ile S-.?6 names it @&OMM5.A <$ allows several guest OS1es to run simultaneously. Device and core grouping is defined a <$ configuration at t+e mac+ine startup. Because of it, <$ is yet to #e improved as fle*i#le as w+at >en can do in t+ese days. 'wo serial ports are provided in )* processor. <$ can dynamically #ind one of serial ports to running OS instance as it console. (#! )*are metal environment+ BM( is an .-& to #uild @lig+t weig+t monitorA w+ic+ runs designated coreGsH run special purpose @driverA for data-plane processing. &n general any BM( program needs accompanying fully-featured OS, like %inu*, as a control plane to manage t+e w+ole software comple*. iMes+ messaging facility .-& is used #y control OS to communicate wit+ BM( programs w+ic+ run on separate coreGsH. Several code e*amples are provided #y 'ileraD one '&%( core runs @encryption serverA on BM( w+ile %inu* as @clientA w+ic+ receives t+e results from BM(. &n t+is e*ample data transfer is done in a s+are page wit+ +elp of 5DN messaging #etween two. . num#er of %inu* process get private cores to run and communicate eac+ ot+er wit+ 5DN messaging and s+ared pages. 0$ NetBSD6ti'e '+is port is #ased on NetBSD .! S'.B%( code set. &t1s a "#it SM- kernel and "#it userland. '+e kernel runs as a guest OS con2unction wit+ 'ilera <ypervisor. NetBSDBtile uses )66 "../ ported #y 'ilera. We +ave #een using it as it is. .s )66 ".= is still in use in NetBSD .! code set, we integrated )66 "../ to start. "#it pmap was implemented from scratc+. &t1s modeled after .lp+a pmap. .lt+oug+ '&%(-)* offers 4/ different page si,es, <$ employs muc+ +um#le page si,e selection. We c+ose "QB page for NetBSDBtile as it is parallel to 'ilera %inu* $M implementation. '+e virtual address partitioning is @4! I 9 I 9 I 4.A NetBSDBtile utili,es SM- ready NetBSD kernel internal as large as possi#le. NetBSD= introduced muc+ sop+isticated kernel constructs and .-& sets w+ic+ are effective and useful for scala#le SM- OS. Since t+en gradual streamlining +as #een done for fore-running SM- NetBSD ports. Now NetBSD is a mature platform to make a 2ump start for fres+ SM- porting. '+e following is t+e typical set of useful SM- .-&D atomicJopsG/H kcpusetG8H *callG8H '+e first group must #e implemented in early kernel porting stage. &n most cases t+ey +ave to #e written wit+ assem#ler code to #e #est suited for particular processor nature. '+e latter two are pure software construct written in plain 6 code. -arallel programming model is NetBSD pt+read. NetBSD pt+read is well organi,ed to adapt various processors wit+ minimum effort. We did not make particular modification for '&%(-)* support. &t works 2ust like as any ot+er pt+read implementations like one in 'ilera %inu*. $ery limited num#er of assem#ler files were written so far. One one for kernelD it1s @locore.SA '+e file contains " well define ma2or routinesD 6-5 startup for primary core and secondary cores. (*ception entry B dispatc+ B return 6-5 conte*t switc+ fast software interrupt dispatc+ B return Ot+er assem#ler files are for li#raries and a few application program like rtldG4H. '+e following is t+e list of ma2or '&%(-)* dependency in concern. srcBcommonBli#Bli#cBarc+B srcBli#Bli#cBarc+B srcBli#e*ecBrtldBarc+B 0$#$ 7ey )e!ign )ei!ion! &n t+is section we descri#e concisely a#out some design decisions to make a port reali,ed. struct trapframe, struct switc+frame and struct pc#. 5-.6( to +old kernel stack and struct pc#. pmapG8H to interface processor wit+ NetBSD $M. (*ception and interrupt +andling to comply target processor design intent. &-& @inter-processor interruptA w+ic+ is essential to make SM- possi#le. struct trapframe is a snaps+ot image of runtime conte*t. One trapframe is always created at t+e +ig+ end address of 5S-.6(. .ctual kernel stack starts 2ust #elow of it to grow downward. '+e reserved trapframe area is for user process conte*t. W+enever user process gets interrupted #y e*ception or device notification, t+e trapframe is to record t+e user conte*t to resume later. '+is area is also used for system call. W+ile in kernel mode, kernel gets interrupted #y t+e same reasons as user mode process does. .t t+e occasion, trapframe is created and pus+ed on kernel stack. '&%( arc+itecture +as "* register file. 9 out of t+em are not a part of process conte*t and to #e e*cluded. We c+ose "* "#it F =40B si,e anyway for struct trapframe. &n vacant fields we place some e*tra conte*ts for process to retain. '+ey are e*ception return address, status register value at t+e time w+en e*ception +appened, offending e*ception type and a value of a certain S-?, @6mp$alueA indeed, for cmpe*c+ instruction. struct switc+frame is for 6-5 conte*t switc+. NetBSD defines two conte*t switc+ routines. cpuJswitc+toG8H and lwpJreturnG8H are t+e routine to perform conte*t switc+. '&%( arc+itecture +as a large set of caller-saved register. Our switc+frame is 0=* 9B F 0!!B in si,e. struct pc# is one of longest surviver among 5N&> kernel primitives. &t got smaller t+an used to #e since t+e way +ow to run conte*t switc+ made smarter. Our struct pc# is as small as 2ust to +old struct switc+frame and a #it e*tra. 5S-.6( si,e is "QB as aligned wit+ NetBSDBtile page si,e. 0$+$ "SID *anage*ent .S&D management is modeled after t+e way used for NetBSDBalp+a and NetBSDBmips. &n t+is section we e*plain it in larger degree. Qernel +as a varia#le for @.S&D generation num#erA to make sure a uni7ue .S&D assigned for running process in processor. &t1s a central idea. Our .S&D management algorit+m works in t+is way. pmapJactivateG8H, one of NetBSD kernel .-&, switc+es processor1s current .S&D value w+enever a new process is ready to take control. Switc+ing current .S&D is a lig+t weig+t operation for OS as it eliminates t+e necessity of '%B flus+ at every conte*t switc+. .S&D-less processors need to perform t+e w+ole scale '%B invalidation to discards all entries at every conte*t switc+. .s '%B works as a cac+e for $M address translation, '%B flus+ +earts severely '%B +it ratio spoiling $M performance. .S&D-aware processors 2ust need to switc+ current .S&D value. 6+anging processor current .S&D can #e considered to switc+ imaginary '%B store w+ic+ e*ists for eac+ .S&D value. (very new #orn process +as no .S&D assigned. pmapJactivateGH c+ooses new one w+ic+ is never allocated so far and assign it wit+ t+e process. pmapJactivateGH also records t+e current .S&D generation num#er in t+e process1s pmap store. .S&D is a small num#er to count only up to 0==. &f pmapJactivateGH finds t+e 9#it gets e*+austed, t+en it #umps .S&D generation num#er in a kernel varia#le #y 4 and c+ooses a new .S&D wrapped to t+e least availa#le num#er Gnormally 4 as .S&D ! is reserved for NetBSD kernel pmapH. On t+is occasion, kernel makes full scale '%B invalidation to discard all '%B entries. W+enever pmapJactivateGH is a#out to switc+ current .S&D, it c+ecks .S&D generation num#er in kernel varia#le matc+es t+e process1s generation num#er recorded at .S&D creation. &f t+ey differ, it means t+e process1s .S&D is no longer valid. pmapJactivateGH selects and assigns a fres+ .S&D for t+e process to run recording current .S&D generation num#er too. )iven any moment every running process +as its own uni7ue .S&D. '+e generation num#er sc+eme reduces t+e necessity of full scale '%B invalidation in great degree. '%B flus+ only +appens w+en .S&D range gets run out and .S&D generation num#er is to #e #umped. 0$/$ TLB !2oot)o8n '%B s+ootdown is t+e essential operation in any SM- kernel. %ike as processor cac+e, '%B is a local resource to processor core. '+e way to invalidate local cac+e or local '%B is provided #y a certain mec+anism. &n general invalidating remote '%B is as +ard to arc+ive as invalidating remote cac+e. &n SM- system, '%B invalidate operation must #e propagated to multiple cores w+ic+ +ave #een running a particular process. -rocess1s pmap must maintain a @processor setA to track w+ic+ cores +ave run it. <ere goes t+e e*planation of remote '%B s+ootdown #y .S&D #umpD W+en pmapG8H detects t+e necessity to invalidate one or more '%B entry of particular process, kernel needs to run invalidate operation #ot+ forD t+e @localA core w+ic+ +appens to run t+e kernel on #e+alf of pmapGH at t+e very moment, and all of @remoteA cores w+ic+ t+e process1s pmapGH are aware of. '+e latter operation is named @'%B s+ootdown.A &t1s implemented wit+ &-&. &t triggers a remote core action #y inter-core message. '%B s+ootdown logic can #e #uilt in wit+ +elp of *callG8H @cross callA kernel .-&. . smart .S&D management can ac+ieve remote '%B invalidation wit+ a small cost. mark .S&D in offending process1s pmapGH store @unassigned.A #roadcast a *callG8H message to remote cores triggering &-&. W+en one of cores is a#out to run t+e process in t+e ne*t sc+eduling, pmapJactivateGH will c+oose and assign a fres+ .S&D t+e offending process. '+e stale '%B entry wit+ a#andoned .S&D gets invalidated at once. 0$4$ 9!ef&' SMP fai'itie! in NetBSD6 SM- NetBSD kernel provides t+e way to manage 6-5 in finer gain. '+ere are less known set of useful commands. %et us mention a#out t+em in #rief. cpuctlG9H ... try @BusrBs#inBcpuctl listA on your modern &ntel computers. &t s+ows t+e list of 6-5 state w+ic+ tells online B offline. prsetG9H ... try @BusrBs#inBprset -pA on your modern &ntel computers. &t can create ar#itrary num#er of @processor setA w+ic+ is #ound wit+ any process. 6-5 affinity is made possi#le #y processor set #inding. Would #e possi#le to #ind a processor set wit+ a kt+read Gkernel t+readH w+ic+ runs specific kernel su#system like )#( andBor disk drivers. sc+edctlG9H ... try @BusrBs#inBsc+edctl -p 4A on your modern &ntel computers. &t assigns one of predefine sc+eduling policy to a process. &t replaces niceG4H and reniceG9H priority control commands. '+ree difference sc+eduling policies provided #y NetBSD so far. 'ime-s+aring w+ic+ follows t+e tradition 5N&> semantic used for long time. :irst-in, :irst-out ?ound-ro#in 0$0$ :&t&re )e3e'o,*ent '+is pro2ect is active. <ere we try to make a summary a#out missing functionalities and future development in some ar#itrary order. Soon to use '&%(-)* native :- instructions. 6urrently t+e entire NetBSD including userland is made wit+ @R DSO:':%O.'A compile option. Drivers for some conventional -6&e devices like S.'. andBor 4!!M (t+ernet N&6. 6urrently w+ole system code image is in2ected wit+ 5SB de#ugging facility to run N:S diskless configuration. iMes+ communication .-& for NetBSD. &t remain under researc+. :or now t+ere is no provision to utili,e iMes+ programming. Mi6. integration wit+ a proper .-&. NetBSD kernel +as pcuG8H @per-6-5-unitA framework. &t1s for t+e encapsulation of 6-51s +ardware conte*t to save B restore. &t +andsomely covers t+e cases #eyond t+e general purpose register. '+e typical usage of pcuGH is to manipulate :-5 register set. We1re considering w+et+er pcuGH can integrate multiple Mi6. units to NetBSD kernel in sane manner. NetBSDB*en allows dynamic attac+ B detac+ maneuvre w+ile kernel is up-running. &t allows core to attac+ B detac+ dynamically and allows #lock device attac+ B detac+ dynamically. We assume it1d #e some difficult to implement similar functionality in '&%(, +owever, it1d wort+ persuading t+e way to make t+em possi#le. We1re aware of 'ilera <$ +as no provision to startup S tear down @targetedA core w+ile up-running. <$ source code is disclosed as a part of 'ilera MD( development package. &t1s said t+at <$ can #e e*tended for customer1s own needs. %%$M transition from )66" is recogni,ed mandatory as it would e*ploit t+e potential of '&%( $%&W nature. 6$ NetBSD TILE-Gx a,,'iation! We focus on compute-intensity markets. We1re considering to engaged in SDN, $%DB searc+ engine and desktop <-6. SDN )So"tware De"ined Network+ &t1s t+e t+ird wave of virtuali,ation tec+nologyD server virtuali,ation, storage virtuali,ation and t+en network virtuali,ation. &ndustry trends predict t+at routers and firewall will vanis+ soon w+ile t+ey are morp+ing into #ig smart switc+es. Bangalore team is now e*ploiting super fast frame forwarding algorit+ms. '+ey are generali,ed for @searc+- and-lookupA computational comple*ity reduction pro#lem. -ro#lem statements are now #eing defined. '+e implementation of algorit+ms must #e ro#ust enoug+ to +andle incoming frame stream as fast as arriving in wire-speed rate. '+ey must also #e ro#ust enoug+ com#ination e*plosions of matc+ing rules. ,LD( search engine &n t+ese days $ery %arge scale DB are directly connected wit+ &nternet. &t1s working in real time manner. '+e typical case is SNS like :aceBook. @mem-cac+ingA is now a common tactic to implement super fast searc+ engine. We recogni,e many-core processor and )-)-5 are now gat+ering industry attention as t+ey would #e good ve+icle in engineering sense for $%DB searc+ engine. Desktop -P$ )-igh Per"ormance $omputing+ &t1s a kind of +uman #eing1s forever desire to own super computer at +and. '&%(-)* can #e a +andy #asis of many-core "#it general purpose computer. &t1s said t+at t+e ne*t generation of )* can #e e*tended #y wiring multiple processor wit+ &nter%aken inter-connect. 'oday a pair of 'esla )-)-5 4* lane -6&e cards can arc+ive 'flops grade computing power. '+en, +ow a#out making t+e twenty first century incarnation of desktop personal computer, let1s say, w+ose outlook are 2ust like as S)& &ndigo or Ne*t6u#eT 4$ Con'&!ion -oring NetBSD .! to '&%(-)* is found easier t+an anticipated since NetBSD .! provides SM- ready kernel constructs and .-& sets to use. '+e num#er of lines written in assem#ler was very small as t+e essential part of porting #urden are well defined. $%&W nature of t+e processor is recogni,ed not a +urdle. "(no8'e)g*ent Sanctum Networks wis+es to e*press its gratitude to all mem#ers involved in t+is pro2ect, especially t+e mem#ers from 3apan w+o contri#uted critically in t+e early stages.