Abc2013 K1 Paper

64bit SMP NetBSD OS Porting
for TILE-Gx VLIW Many-Core Proe!!or

Toru Nishimura
Sanctum Networks, Pvt. Ltd.
nisimura@sanctumnetworks.com
"b!trat
Many-core processor is an attractive platform to run a general purpose OS like NetBSD. We, a team in Sanctum
Networks, ported NetBSD .! to "#it $%&W many-core processor named '&%(-)*. &n t+is paper we introduce
distinctive features of '&%(-)* in some dept+ as t+e product remains less known in our engineering community.
NetBSD porting was made well and smoot+ t+an anticipated. We reali,ed t+at NetBSD .! is a mature SM- OS w+ic+
provides streamlined kernel structure and offers ric+ set of kernel .-& specifically designed for large degree SM-,
#eyond /0 processor, configuration. .s a part of conclusion we mention a#out some of '&%(-)* NetBSD application
area w+ic+ we1re willing to #uild.
#$ Pro%et o&t'oo( an) ti*e 'ine
'+is porting pro2ect was initiated in mid 3une 0!40 at
'okyo. '+e goal was to ac+ieve full SM- capa#ilities on
t+e / core '&%(-)* processor and understand t+e
suita#ility of t+e '&%( arc+itecture for various application
development.
our porting target is 'ilera (mpower 45 computer
wit+ / core '&%(-)* processor.
development environment wit+ cross compiler was
made at early 3uly 0!40.
geograp+ically separated two )&' repositories in
pus+-pull sync+roni,ing.
3apan side +osts are in Sakura $-S and .ma,on
(60.
kernel image linking completed in late 3uly 0!40. &t
contained a lot of sta# code guaranteed not to work.
single core kernel was successful in running ramdisk
sysinst program at 0!40-4!-/!.
since t+en, kernel sta#ility, )#( driver and SM- +as
#een persuaded.
as of Marc+ 0!4/ porting pro2ect is going active. .
num#er of functionalities, in particular ones for
'&%(-)* uni7ue features, are under development.
+$ TILE-Gx feat&re!
'&%(-)* design was invented #y an M&' professor, Dr.
.nant .garwal. &t1s t+e latest incarnation of a long time
researc+ since 489!s. '&%(-)* is t+e t+ird generation
product. '+ere are two successors, '&%(" and '&%(-ro
w+ic+ are #ased on t+e same '&%( arc+itecture approac+.
:ollowing to two /0#it designs, t+e t+ird generation
model is made "#it processor.
'&%( arc+itecture emp+asi,es t+e scala#ility and low
power wit+ a uni7ue on-c+ip inter-connection tec+nology.
'&%(-)* product family +as 8 ; 4!! core configuration.
(ac+ of core is laid out in tiles-on-wall fas+ion. 6ore
runs at 4.0)<, clock. 5nder work load wit+ modest
network activity / core '&%(-)* is said to ac+ieve
0=-/!W power consumption in total.
'&%(-)* is a "#it processor. &t +as "#it integer and
"#it floating point operation, "#it register and "#it
address space #y "#it pointer. '+ese simple
c+aracteristics are 7uite familiar to any of plain old 5N&>
programmers since t+e time w+en M&-S ?"!!! was
introduced at early 488!s.
+$#$ In!tr&tion !et feat&re
'&%( instruction set arc+itecture is / way $%&W. 'wo or
t+ree instructions are in a single "#it word w+ic+ is in
turn called @instruction #undle.A '&%( can run up to
t+ree instructions simultaneously.
'&%(-)* +as si*ty four "#it register file. 'a#le 0-4
s+ows register definition. =9 of t+em, including two
+ardwired ,ero registers, are general purpose. &t contains
t+read pointer tp register to facilitate t+read
programming. &t s+ows .B& to define t+e common
register usage program oug+t to follow.
regi!ter *ne*oni ty,e &!age
! - 8 r- -r. saved #y caller argumentsBreturn values
4! - 08 r#- - r+. saved #y caller @temporaryA
/! - =4 r/- - r0# saved #y callee @safe across callA
=0 r0+ saved #y callee frame pointer
=/ t, dedicated t+read pointer
=" !, dedicated stack pointer
== 'r saved #y callee return address
= !n always ,ero
=C
=9
=8
!
4
0
i)n-
i)n#
&)n-
&)n#
&)n+
&)n/
onc+ip network
communication
&BO Dynamic Network !
&BO Dynamic Network 4
5ser Dynamic Network !
5ser Dynamic Network 4
5ser Dynamic Network 0
5ser Dynamic Network /
/ 1ero always ,ero
Tab'e +-# regi!ter a!!ign*ent for TILE-Gx "BI
Note t+at '&%( offers a rat+er large num#er of, 4!,
arguments in register for function call. '+e return values
are placed in t+e same set of register for arguments. '+is
means r! register content will #e destroyed 7uite often
w+en a function e*its. Better to remind it as one of '&%(
de#ugging tips.
Si* registers are reserved for inter-core communication
via on-c+ip network named 5DN and &DN. '+ese
registers work as :&:O ports muc+ like :S% @fast-
simple*-linkA in MicroBla,e processor. ?eading from
empty register or writing on occupied register may cause
t+e processor to stall until condition meets.
?egisters are s+ared resource among $%&W concurrent
e*ecution flow. (ac+ su#routine +as a single entry point
and a single e*it. No parallel su#routines are in action.
?egister conflicts in parallel e*ecution is considered
program error. &t #rings undefined B une*pected values in
register file. ?egister conflict avoidance is t+e
programmer1s responsi#ility.
'&%( instruction +as two different formatsD >-format for
0 instruction in a #undle and E-format for / instruction in
a #undle. Basic arit+metics is done in / operand form.
.s register file is " in si,e, /-operand instruction
re7uires 49#it F * / for register designation plus some
more for opcode. /-in-4 "#it #undle is considered rat+er
tig+t encoding, +owever, contri#utes instruction density
muc+. 0 register arit+metics +ave signed 9#it immediate
or signed 4#it immediate value. '+e latter instruction
#elongs to 0-in-4 #undle >-format as it needs to +ave
longer encoding. 5noccupied instruction slot in a #undle
is filled wit+ NO- w+ic+ works as a flowing #u##le in
e*ecution pipeline.
'+e most nota#le difference from conventional 6&S6 B
?&S6 instruction set is t+e lack of @register indirect wit+
offsetA addressing mode. '&%( +as no @%W ?/,
!*4C9G?0HA style memory access. '+is means t+at local
varia#le on stack andBor G6 languageH struct mem#er must
#e accessed t+roug+ an e*plicit pointer in a temporary
register to refer t+e target address. &t1s a stark contrast to
t+e case w+ere @register indirect wit+ offset@ addressing
mode can ac+ieve load B store operation wit+ @#ase
register I immediate value offsetA very +andy for local
varia#le and struct mem#er. '+is is anot+er tip for '&%(
programming to remem#er.
+$+$ Ot2er &!ef&' in!tr&tion
'&%( instruction set +as an ort+ogonal set of atomic mat+
B lock instructions.
fetc+add .tomic addition
fetc+and .tomic logical .ND
fetc+or .tomic logical O?
cmpe*c+ compare-and-swap
Tab'e +-+ ato*i in!tr&tion!
.ll of t+em +ave two variations for 9#yte operation and
"#yte operation. '+ese instructions are comforta#ly
useful to implement NetBSD atomicJopsG/H routines.
'+ey are well defined set of M--safe atomic operations
and widely used in SM- NetBSD kernel construct andBor
parallel programming li#rary like pt+readG/H.
6mpe*c+ instruction is for 6.S @compare-and-swapA or
'.S @test-and-setA operation. &t works like as &ntel
cmp*c+g instruction. Note t+at it1s not #ased on %%BS6
sync+roni,e model found in M&-S, .lp+a, -ower-6 and
.?M". '&%( cmpe*c+ works wit+ accompanying
@6mp$alueA S-? register. %ocking primitives can #e
implemented wit+ it in usual manner.
'&%(-)* +as ric+ set of DS- and S&MD instruction. &t
also +as some fancy instructions. . set of #it field
operations, 6%K Gcount-leading-,erosH and 6?6/0
polynomial mat+ for +as+ing B c+ecksum and so on.
'&%(-)* floating point mat+ does not +ave dedicated
register set. :- instruction uses )- registers for source B
destination operands.
+$/$ "))re!! !,ae
'&%(-)* +as "0#it effective address #it out of "#it $.
virtual address pointer. $irtual address is separated in
upper 0'B space and lower 0'B space. '+ere is a large
void in #etween. $.L/M"4N is eit+er of all-! or all-4, t+at
is, sign e*tended from $.O"4P value.
'&%(-)* +as no M&-S QS()! B QS()4 B >Q-<ES like
address segment e*ists to distinguis+ cac+e nature.
Software is in c+arge of address in+a#itation and cac+e
nature control wit+ +elp of smart '%B usage.
+$4$ Layere) ,rotetion
'&%( arc+itecture provides four level protection sc+eme.
%evel is ranging from ! least protected to / most
protected. &t allows to #uild layered protection domains
w+ic+ run protected programs in eac+ level.
-%! 5ser applications
-%4 )uest OS
-%0 'ilera <ypervisor
-%/ @ virtual mac+ine monitorA Gw+o knows w+at
really it is.H
Tab'e +-/ TILE ,rotetion 'e3e'
-rogram runs in low order protection level is in+i#ited to
touc+ resources in +ig+t order level. (ac+ core runs one
of four protection level. 6urrent protection level of
individual core is called 6-%. 6ontrol transfer is done #y
e*ecuting dedicated instructionD swint!, swint4, swint0,
swint/. NetBSD uses swint4 instruction for system call to
#e issued #y applications programs.
'+ere is a large set of S-? registers. mfspr B mtspr
instructions operate t+em. S-? num#er is encoded in
4"#it. Most of S-? registers +ave t+eir own M%
@minimal protection levelA value to ar#itrate w+ic+ level
G! ; /H of program can access to. M-% is t+e #asis of
layered protection for '&%( runtime environment.
+$0$ TLB an) TSB
'%B plays a central role in '&%( arc+itecture. &n '&%(
arc+itecture '%B dost not 2ust make $M virtual memory
possi#le #ut also reali,es c+ip-wide glo#al cac+e
co+erency. '&%( '%B entry is designed to #e multi-core
aware. '%B entry optionally +olds t+e location of core in
c+ip Gin >-E coordinateH to track and identify +ow '%B
entry to tell $.--. mapping is tied wit+ a specific core.
%ike as most of modern processors, '&%( '%B is
software managed. '&%(-)* +as independent '%B stores
for instruction and dataD 4 entry i'%B and /0 entry
d'%B. Note t+at '%B is a s+ared resource among
programs w+ic+ run in different protection domain. '+e
"0#it $. space is also s+ared among t+em. 'ilera
<ypervisor reserves some of '%B entries for its own.
?emaining is free for guest OS and application programs
to use.
'+e '%B management strategy is modeled after S-.?6
processor. '&%( uses @'SBA and @''(A nomenclature
for t+e very same purposes.
'SB @translation store #ufferA is a software e*tension of
'%B. 'SB +olds a super set of '%B in main memory. &t
works as a staging area to in2ect '%B entry into
processor1s i'%B or d'%B. <$ is in c+arge for '%B miss
+andling. &t always consults wit+ 'SB content in action.
)uest OS can only operate 'SB store. .s '%B is one of
+ig+ly sensitive s+ared resource among various
programs, guest OS can not make access '%B. 'SB is
normally reserved inside protected guest OS memory
area. '&%( 'SB is a unified one to +old i'%B entries and
d'%B entries. '+e approac+ is different from S-.?6"
w+ic+ +as i'SB and d'SB in parallel. ''( @translation
ta#le entryA is t+e software defined intermediate format
of '%B entry.
On '%B miss <$ takes control to run '%B refill
operation. &t searc+es first t+e offending '%B entry in
'SB store. &f t+e target entry is found, <$ in2ects it to
eit+er of i'%B or d'%B and complete refill operation. &f
<$ finds 'SB +as no suc+ entry, t+en it posts a re7uest
for guest OS to come in and solve t+is @'SB missA
condition. )uest OS, in turn, responds to t+e '%B miss
e*ception identifying it as genuine access error or
recovera#le fault condition. '+e rest of operation is
identical to popular software managed '%B processors.
&f guest OS finds t+e e*ception is true '%B refill case, it
adds t+e offending '%B entry into 'SB store and returns.
<$ will take care t+e refilling. &f guest OS finds t+e
e*ception access error or protection violation, it performs
its way to +andle t+e cases.
'&%( +as .S&D @address space identifier.A .S&D is to
improve '%B +it ratio, t+at is, #etter $.-P-. translation
efficiency. &t1s as normal as and identical to ot+er .S&D
processors. '&%( .S&D is 9#it, offering 0= individual
address spaces to #e distinguis+ed for '%B lookup
operation. Some literatures incorrectly mention t+at .S&D
is an e*tension of $., like saying it reali,es concatenated
9 I "0 address #it. .S&D is to virtuali,e '%B, or to make
imaginary multiple '%B stores w+ic+ are num#ered and
iterated #y .S&D. .S&D demands a smarter $M to
operate. '+is topic will #e discussed in a later section.
.s ot+er processors do, '&%( processor +andles many
kinds of interrupt B e*ception. Device async+ronously
posts variety of re7uests and different types of e*ception
+appens w+ile a processor is in action. '&%( uses &-&
@inter-processor interruptA not only for pure inter-
processor messaging, #ut also for &BO device interrupt
notification. .s '&%( integrated on-c+ip devices are
located apart of core and notification comes across on-
c+ip network, it1d #e reasona#le to use &-& laminating
many into a single form.
+$6$ iMe!2 on-2i, inter-onnet
-rocessing core is laid out in a tiles-covering-wall fas+ion
wit+ mes+ s+ape inter-connect to couple eac+ ot+er.
&nter-connect +as >-E B street-avenue like layout. .t
eac+ crossing is an independent switc+ processor to tie a
computing node wit+ t+e entire switc+ network. 'ilera
names it iMes+ tec+nology.
Switc+ processor is 4#it ?&S6 to run low latency and
+ig+ #andwidt+ switc+ing function t+roug+ limited
num#er of signal connections. Besides of " pat+s for N,
(, S and W directions to neig+#oring switc+es, one
switc+ data-pat+ is coupled wit+ processor1s %0 cac+e.
Data stream travels t+roug+ %0 first, t+en eit+er of %4
i6ac+e or d6ac+e reac+ing to a processing core.
'+e inter-connect offers 5DN @5ser Dynamic NetworkA
and &DN @&BO Dynamic NetworkA for general purpose on-
c+ip streaming and messaging communication. 'otal
register of '&%(-)* processor are assigned to
accommodate t+e ease of programming.
&t s+ould #e reminded t+at iMes+ does not implement nor
enforce any kind of @smart network topology.A '+ere
was a num#er of massively-parallel multi-processor super
computers #uilt from time to time. .ll of t+em more-or-
less persuaded a smarter topology for processor inter-
connect to maintain low-latency and +ig+ #andwidt+.
Nota#le e*amples are 6ray '/D and Si6orte* S6=9/0.
'/D +ad a /-dimensional @torusA grap+ topology to make
.lp+a processors tig+tly coupled eac+ ot+er. S6=/90 +ad
@Qaut, grap+A topology to inter-connect -core M&-S"
processor wit+ t+e +elp wit+ #uilt-in DM. engine to talk
wit+ %0 cac+es and &BO devices.
&n '&%( arc+itecture on-c+ip inter-connect is software
defined. Switc+ processor can program t+e network
topology to adapt varying demands. &n t+is way, '&%(
arc+itecture can maintain t+e fle*i#ility and t+e
scala#ility in parallel. &t1s unlikely @topology optimi,edA
super computers can ac+ieve #ot+ natures in #alance.
iMes+ .-& is provided to make finer control over on-
c+ip network. 6ores can #e partitioned into groups w+ic+
work parallel as if t+ey are islands. '+is feature is
implemented #y switc+ network programma#ility
Wit+ +elp #y @topology-awareA and @cac+e attri#ute
awareA '%B entries, iMes+ acts a central role for cac+e
co+erency.
+$4$ Ca2e )e!ign an) feat&re
(ac+ core +as /0QB i6ac+e, /0QB d6ac+e and 0=QB iB
d com#ined %0 cac+e. (it+er of %4 cac+e +as $&-'
@$irtual &nde* and -+ysical 'agA nature.
%4 i6ac+e /0QB, 0 way associative, "B line si,e.
%4 d6ac+e /0QB, 0 way associative, "B line si,e,
write-t+roug+.
%0 cac+e 0=QB, iBd com#ined. 9 way associative,
"B line si,e, write-#ack.
Tab'e +-4 a2e 2arateri!ti!
Some '&%( processor literatures mention to @co+erent %/
cac+e.A &t1s some+ow imprecise. '+e %/ functionality is
ac+ieved #y a group of %0 cac+e. '+e sc+eme is called
@cac+e +oming.A %et us start t+e e*planation.
'&%( %4 cac+e is inclusive to %0. %4 +olds su#set of %0
contents at any moment. %0 miss +appens w+en
offending cac+e line data is not found in %0. 6ore asks
a#out t+e missing cac+e line data to @neig+#oring coresA
w+ic+ are grouped #y <$ for a single OS instance. &f
found t+ere, cac+e line data is transferred to re7uesters %0
cac+e.
:oreign %0 cac+es work as an e*tension of local %0
cac+e. &n ot+er words, a group of cores s+are t+eir %0
cac+e contents eac+ ot+er. '+is sc+eme is named @cac+e
+omingA and 'ilera calls t+e group of %0 cac+e as
@co+erent %/A cac+e. / core )* processor +as @8MB
co+erent %/A F /* 0=QB %0.
6ac+e lines can #e populated sparsely among different %0
to improve t+e cac+e efficiency. %/ cac+e +oming is one
page attri#utes. &t1s controlla#le #y per-page #asis.
/$ TILE-Gx on-2i, )e3ie!
integrated multiple DDR memory controller
0 controllers in / B4 core models, 4 in 8 core model.
'&%(-ro, t+e successor of '&%(-)*, " core model +ad
four DD?0 memory controller on c+ip. Wit+ dual
controller configuration, memory can #e driven in
interleaved fas+ion.
mPP! packet classi"ier
&t1s a programma#le intelligent packet engine. &t offers
@frame parseA function to run @sieve-to-forwardA
classification on incoming (t+ernet frame stream at line
speed. m-&-( is tig+tly integrated wit+ )#( B 4!)
(t+ernet network interface.
"* 4!) ports are availa#le in / core model.
(ac+ port can #e reprogrammed to +ost "* )#(
network interface.
)#(-only ports are also availa#le in 4 B 8 core
model.
m-&-( +as local #uffer memory to +andle incoming and
outgoing (t+ernet frames. m-&-( can perform load
#alancing to distri#ute ingress frames to cores.
6ore #inds m-&-( device register set to a particular
virtual address wit+ a designated d'%B entry for control.
m-&-( in turn +olds an &BO '%B entry to access data
w+ic+ resides in target G;accelerating application or guest
OSH address space so t+at it can understand $.-P-.
translation for frame data and accompanying descriptors.
m-&-( +as its own /0#it instruction set. . special )66
toolc+ain is provided to program it.
two or t+ree operand instruction.
/0* /0#it register fileD 00 of t+em are general
purpose.
-rivate S-? registers wit+ mfspr and mtspr to use.
#i$% crypto and compression engine
&t1s a standalone computing processor populated inside
'&%(-)*. Multiple Mi6. processors are on a single )*.
Mi6. can copy data w+ile encrypting and compressing
operation in action. &t1s a streaming operation.
6ore #inds Mi6. device register set to a particular
virtual address wit+ a designated d'%B entry for control.
Mi6. in turn +olds an &BO '%B entry to access data
w+ic+ resides in target address space so t+at it can
understand $.-P-. translation for crypto B compression
data.
$onventional &' devices
'+ere are some conventional &BO devices like -6&e,
5SB0.! and &06BS-& in our porting target computer.
-6&e controller works in eit+er root-comple* G+ostH or
end-point GdeviceH mode. 5SB0.! is used for multiple
purpose. &t works as virtual console w+ile in
development and de#ug. &t can also in2ect a #inary image
to )* processor to run. '+e #inary image consists of
#oot programs, <$ image and guest OS in predefined
format.
4$ Ti'era 5y,er3i!or
<$ utili,es '&%( protection level feature. )uest OS +as
+eavily limited access to S-? registers. Only +andle
num#er of S-? registers allowed to used #y )uest OS.
<$ is populated at t+e 4MB area in t+e upper 0'B space
wit+ a +ardwired '%B entries.
<$ +as great control over t+e entire '&%( processor
comple*. <$ makes cores into groups w+ic+ are manged
in M * N rectangle s+ape to form OS instance.
<$ assigns &BO devices to particular instances wit+ &BO
'%B entries. 'ilera calls t+e sc+eme MM&O @memory
mapped &OA sc+eme w+ile S-.?6 names it @&OMM5.A
<$ allows several guest OS1es to run simultaneously.
Device and core grouping is defined a <$ configuration
at t+e mac+ine startup. Because of it, <$ is yet to #e
improved as fle*i#le as w+at >en can do in t+ese days.
'wo serial ports are provided in )* processor. <$ can
dynamically #ind one of serial ports to running OS
instance as it console.
(#! )*are metal environment+
BM( is an .-& to #uild @lig+t weig+t monitorA w+ic+
runs designated coreGsH run special purpose @driverA for
data-plane processing. &n general any BM( program
needs accompanying fully-featured OS, like %inu*, as a
control plane to manage t+e w+ole software comple*.
iMes+ messaging facility .-& is used #y control OS to
communicate wit+ BM( programs w+ic+ run on separate
coreGsH.
Several code e*amples are provided #y 'ileraD
one '&%( core runs @encryption serverA on BM(
w+ile %inu* as @clientA w+ic+ receives t+e results
from BM(. &n t+is e*ample data transfer is done in a
s+are page wit+ +elp of 5DN messaging #etween
two.
. num#er of %inu* process get private cores to run
and communicate eac+ ot+er wit+ 5DN messaging
and s+ared pages.
0$ NetBSD6ti'e
'+is port is #ased on NetBSD .! S'.B%( code set. &t1s
a "#it SM- kernel and "#it userland. '+e kernel runs as
a guest OS con2unction wit+ 'ilera <ypervisor.
NetBSDBtile uses )66 "../ ported #y 'ilera. We +ave
#een using it as it is. .s )66 ".= is still in use in
NetBSD .! code set, we integrated )66 "../ to start.
"#it pmap was implemented from scratc+. &t1s modeled
after .lp+a pmap. .lt+oug+ '&%(-)* offers 4/ different
page si,es, <$ employs muc+ +um#le page si,e
selection. We c+ose "QB page for NetBSDBtile as it is
parallel to 'ilera %inu* $M implementation. '+e virtual
address partitioning is @4! I 9 I 9 I 4.A
NetBSDBtile utili,es SM- ready NetBSD kernel internal
as large as possi#le. NetBSD= introduced muc+
sop+isticated kernel constructs and .-& sets w+ic+ are
effective and useful for scala#le SM- OS. Since t+en
gradual streamlining +as #een done for fore-running SM-
NetBSD ports. Now NetBSD is a mature platform to
make a 2ump start for fres+ SM- porting.
'+e following is t+e typical set of useful SM- .-&D
atomicJopsG/H
kcpusetG8H
*callG8H
'+e first group must #e implemented in early kernel
porting stage. &n most cases t+ey +ave to #e written wit+
assem#ler code to #e #est suited for particular processor
nature. '+e latter two are pure software construct written
in plain 6 code.
-arallel programming model is NetBSD pt+read.
NetBSD pt+read is well organi,ed to adapt various
processors wit+ minimum effort. We did not make
particular modification for '&%(-)* support. &t works
2ust like as any ot+er pt+read implementations like one in
'ilera %inu*.
$ery limited num#er of assem#ler files were written so
far. One one for kernelD it1s @locore.SA '+e file contains
" well define ma2or routinesD
6-5 startup for primary core and secondary cores.
(*ception entry B dispatc+ B return
6-5 conte*t switc+
fast software interrupt dispatc+ B return
Ot+er assem#ler files are for li#raries and a few
application program like rtldG4H. '+e following is t+e list
of ma2or '&%(-)* dependency in concern.
srcBcommonBli#Bli#cBarc+B
srcBli#Bli#cBarc+B
srcBli#e*ecBrtldBarc+B
0$#$ 7ey )e!ign )ei!ion!
&n t+is section we descri#e concisely a#out some design
decisions to make a port reali,ed.
struct trapframe, struct switc+frame and struct pc#.
5-.6( to +old kernel stack and struct pc#.
pmapG8H to interface processor wit+ NetBSD $M.
(*ception and interrupt +andling to comply target
processor design intent.
&-& @inter-processor interruptA w+ic+ is essential to
make SM- possi#le.
struct trapframe is a snaps+ot image of runtime conte*t.
One trapframe is always created at t+e +ig+ end address
of 5S-.6(. .ctual kernel stack starts 2ust #elow of it to
grow downward. '+e reserved trapframe area is for user
process conte*t. W+enever user process gets interrupted
#y e*ception or device notification, t+e trapframe is to
record t+e user conte*t to resume later. '+is area is also
used for system call. W+ile in kernel mode, kernel gets
interrupted #y t+e same reasons as user mode process
does. .t t+e occasion, trapframe is created and pus+ed on
kernel stack.
'&%( arc+itecture +as "* register file. 9 out of t+em are
not a part of process conte*t and to #e e*cluded. We
c+ose "* "#it F =40B si,e anyway for struct trapframe.
&n vacant fields we place some e*tra conte*ts for process
to retain. '+ey are e*ception return address, status
register value at t+e time w+en e*ception +appened,
offending e*ception type and a value of a certain S-?,
@6mp$alueA indeed, for cmpe*c+ instruction.
struct switc+frame is for 6-5 conte*t switc+. NetBSD
defines two conte*t switc+ routines. cpuJswitc+toG8H and
lwpJreturnG8H are t+e routine to perform conte*t switc+.
'&%( arc+itecture +as a large set of caller-saved register.
Our switc+frame is 0=* 9B F 0!!B in si,e.
struct pc# is one of longest surviver among 5N&> kernel
primitives. &t got smaller t+an used to #e since t+e way
+ow to run conte*t switc+ made smarter. Our struct pc#
is as small as 2ust to +old struct switc+frame and a #it
e*tra.
5S-.6( si,e is "QB as aligned wit+ NetBSDBtile page
si,e.
0$+$ "SID *anage*ent
.S&D management is modeled after t+e way used for
NetBSDBalp+a and NetBSDBmips. &n t+is section we
e*plain it in larger degree.
Qernel +as a varia#le for @.S&D generation num#erA to
make sure a uni7ue .S&D assigned for running process in
processor. &t1s a central idea. Our .S&D management
algorit+m works in t+is way.
pmapJactivateG8H, one of NetBSD kernel .-&,
switc+es processor1s current .S&D value w+enever a
new process is ready to take control.
Switc+ing current .S&D is a lig+t weig+t operation
for OS as it eliminates t+e necessity of '%B flus+ at
every conte*t switc+. .S&D-less processors need to
perform t+e w+ole scale '%B invalidation to discards
all entries at every conte*t switc+. .s '%B works as
a cac+e for $M address translation, '%B flus+ +earts
severely '%B +it ratio spoiling $M performance.
.S&D-aware processors 2ust need to switc+ current
.S&D value. 6+anging processor current .S&D can
#e considered to switc+ imaginary '%B store w+ic+
e*ists for eac+ .S&D value.
(very new #orn process +as no .S&D assigned.
pmapJactivateGH c+ooses new one w+ic+ is never
allocated so far and assign it wit+ t+e process.
pmapJactivateGH also records t+e current .S&D
generation num#er in t+e process1s pmap store.
.S&D is a small num#er to count only up to 0==. &f
pmapJactivateGH finds t+e 9#it gets e*+austed, t+en it
#umps .S&D generation num#er in a kernel varia#le
#y 4 and c+ooses a new .S&D wrapped to t+e least
availa#le num#er Gnormally 4 as .S&D ! is reserved
for NetBSD kernel pmapH. On t+is occasion, kernel
makes full scale '%B invalidation to discard all '%B
entries.
W+enever pmapJactivateGH is a#out to switc+ current
.S&D, it c+ecks .S&D generation num#er in kernel
varia#le matc+es t+e process1s generation num#er
recorded at .S&D creation. &f t+ey differ, it means
t+e process1s .S&D is no longer valid.
pmapJactivateGH selects and assigns a fres+ .S&D for
t+e process to run recording current .S&D generation
num#er too.
)iven any moment every running process +as its own
uni7ue .S&D. '+e generation num#er sc+eme reduces
t+e necessity of full scale '%B invalidation in great
degree. '%B flus+ only +appens w+en .S&D range gets
run out and .S&D generation num#er is to #e #umped.
0$/$ TLB !2oot)o8n
'%B s+ootdown is t+e essential operation in any SM-
kernel. %ike as processor cac+e, '%B is a local resource
to processor core. '+e way to invalidate local cac+e or
local '%B is provided #y a certain mec+anism. &n
general invalidating remote '%B is as +ard to arc+ive as
invalidating remote cac+e.
&n SM- system, '%B invalidate operation must #e
propagated to multiple cores w+ic+ +ave #een running a
particular process. -rocess1s pmap must maintain a
@processor setA to track w+ic+ cores +ave run it. <ere
goes t+e e*planation of remote '%B s+ootdown #y .S&D
#umpD
W+en pmapG8H detects t+e necessity to invalidate one or
more '%B entry of particular process, kernel needs to run
invalidate operation #ot+ forD
t+e @localA core w+ic+ +appens to run t+e kernel on
#e+alf of pmapGH at t+e very moment, and
all of @remoteA cores w+ic+ t+e process1s pmapGH are
aware of.
'+e latter operation is named @'%B s+ootdown.A &t1s
implemented wit+ &-&. &t triggers a remote core action #y
inter-core message. '%B s+ootdown logic can #e #uilt in
wit+ +elp of *callG8H @cross callA kernel .-&.
. smart .S&D management can ac+ieve remote '%B
invalidation wit+ a small cost.
mark .S&D in offending process1s pmapGH store
@unassigned.A
#roadcast a *callG8H message to remote cores
triggering &-&.
W+en one of cores is a#out to run t+e process in t+e
ne*t sc+eduling, pmapJactivateGH will c+oose and
assign a fres+ .S&D t+e offending process. '+e stale
'%B entry wit+ a#andoned .S&D gets invalidated at
once.
0$4$ 9!ef&' SMP fai'itie! in NetBSD6
SM- NetBSD kernel provides t+e way to manage 6-5 in
finer gain. '+ere are less known set of useful commands.
%et us mention a#out t+em in #rief.
cpuctlG9H ... try @BusrBs#inBcpuctl listA on your modern
&ntel computers. &t s+ows t+e list of 6-5 state w+ic+ tells
online B offline.
prsetG9H ... try @BusrBs#inBprset -pA on your modern &ntel
computers. &t can create ar#itrary num#er of @processor
setA w+ic+ is #ound wit+ any process. 6-5 affinity is
made possi#le #y processor set #inding. Would #e
possi#le to #ind a processor set wit+ a kt+read Gkernel
t+readH w+ic+ runs specific kernel su#system like )#(
andBor disk drivers.
sc+edctlG9H ... try @BusrBs#inBsc+edctl -p 4A on your
modern &ntel computers.
&t assigns one of predefine sc+eduling policy to a
process. &t replaces niceG4H and reniceG9H priority
control commands.
'+ree difference sc+eduling policies provided #y
NetBSD so far.
'ime-s+aring w+ic+ follows t+e tradition 5N&>
semantic used for long time.
:irst-in, :irst-out
?ound-ro#in
0$0$ :&t&re )e3e'o,*ent
'+is pro2ect is active. <ere we try to make a summary
a#out missing functionalities and future development in
some ar#itrary order.
Soon to use '&%(-)* native :- instructions. 6urrently
t+e entire NetBSD including userland is made wit+ @R
DSO:':%O.'A compile option.
Drivers for some conventional -6&e devices like S.'.
andBor 4!!M (t+ernet N&6. 6urrently w+ole system
code image is in2ected wit+ 5SB de#ugging facility to
run N:S diskless configuration.
iMes+ communication .-& for NetBSD. &t remain under
researc+. :or now t+ere is no provision to utili,e iMes+
programming.
Mi6. integration wit+ a proper .-&. NetBSD kernel +as
pcuG8H @per-6-5-unitA framework. &t1s for t+e
encapsulation of 6-51s +ardware conte*t to save B
restore. &t +andsomely covers t+e cases #eyond t+e
general purpose register. '+e typical usage of pcuGH is to
manipulate :-5 register set. We1re considering w+et+er
pcuGH can integrate multiple Mi6. units to NetBSD
kernel in sane manner.
NetBSDB*en allows dynamic attac+ B detac+ maneuvre
w+ile kernel is up-running. &t allows core to attac+ B
detac+ dynamically and allows #lock device attac+ B
detac+ dynamically. We assume it1d #e some difficult to
implement similar functionality in '&%(, +owever, it1d
wort+ persuading t+e way to make t+em possi#le.
We1re aware of 'ilera <$ +as no provision to startup S
tear down @targetedA core w+ile up-running. <$ source
code is disclosed as a part of 'ilera MD( development
package. &t1s said t+at <$ can #e e*tended for customer1s
own needs.
%%$M transition from )66" is recogni,ed mandatory as
it would e*ploit t+e potential of '&%( $%&W nature.
6$ NetBSD TILE-Gx a,,'iation!
We focus on compute-intensity markets. We1re
considering to engaged in SDN, $%DB searc+ engine and
desktop <-6.
SDN )So"tware De"ined Network+
&t1s t+e t+ird wave of virtuali,ation tec+nologyD server
virtuali,ation, storage virtuali,ation and t+en network
virtuali,ation. &ndustry trends predict t+at routers and
firewall will vanis+ soon w+ile t+ey are morp+ing into
#ig smart switc+es.
Bangalore team is now e*ploiting super fast frame
forwarding algorit+ms. '+ey are generali,ed for @searc+-
and-lookupA computational comple*ity reduction
pro#lem. -ro#lem statements are now #eing defined.
'+e implementation of algorit+ms must #e ro#ust enoug+
to +andle incoming frame stream as fast as arriving in
wire-speed rate. '+ey must also #e ro#ust enoug+
com#ination e*plosions of matc+ing rules.
,LD( search engine
&n t+ese days $ery %arge scale DB are directly connected
wit+ &nternet. &t1s working in real time manner. '+e
typical case is SNS like :aceBook. @mem-cac+ingA is
now a common tactic to implement super fast searc+
engine. We recogni,e many-core processor and )-)-5
are now gat+ering industry attention as t+ey would #e
good ve+icle in engineering sense for $%DB searc+
engine.
Desktop -P$ )-igh Per"ormance $omputing+
&t1s a kind of +uman #eing1s forever desire to own super
computer at +and. '&%(-)* can #e a +andy #asis of
many-core "#it general purpose computer. &t1s said t+at
t+e ne*t generation of )* can #e e*tended #y wiring
multiple processor wit+ &nter%aken inter-connect. 'oday
a pair of 'esla )-)-5 4* lane -6&e cards can arc+ive
'flops grade computing power. '+en, +ow a#out making
t+e twenty first century incarnation of desktop personal
computer, let1s say, w+ose outlook are 2ust like as S)&
&ndigo or Ne*t6u#eT
4$ Con'&!ion
-oring NetBSD .! to '&%(-)* is found easier t+an
anticipated since NetBSD .! provides SM- ready kernel
constructs and .-& sets to use. '+e num#er of lines
written in assem#ler was very small as t+e essential part
of porting #urden are well defined. $%&W nature of t+e
processor is recogni,ed not a +urdle.
"(no8'e)g*ent
Sanctum Networks wis+es to e*press its gratitude to all
mem#ers involved in t+is pro2ect, especially t+e mem#ers
from 3apan w+o contri#uted critically in t+e early stages.

Abc2013 K1 Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abc2013 K1 Paper

Uploaded by

Copyright:

Available Formats

64bit SMP NetBSD OS Porting

for TILE-Gx VLIW Many-Core Proe!!or

You might also like