You are on page 1of 7

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com Architecture and Implementation of the ARM Cortex-A8 Microprocessor

1 Introduction
The ARM Cortex-A8 microprocessor is the first applications microprocessor in ARMs new Cortex family. With high performance an power efficiency! it targets a wi e "ariety of mo#ile an cons$mer applications incl$ ing mo#ile phones! set-top #oxes! gaming consoles an a$tomoti"e na"igation%entertainment systems. The Cortex-A8 processor spans a range of performance points epen ing on the implementation! eli"ering o"er to &''' (hrystone M)*+ ,(M)*+- of performance for eman ing cons$mer applications an cons$ming less than .''mW for low-power mo#ile e"ices. This translates into a large increase in processing capa#ility while staying with the power le"els of pre"io$s generations of mo#ile e"ices. Cons$mer applications will #enefit from the re $ce heat issipation an res$lting lower pac/aging an integration costs. )t is the first ARM processor to incorporate all of the new technologies a"aila#le in the ARM"0 architect$re. 1ew technologies seen for the first time incl$ e 1231 for me ia an signal processing an 4a5elle RCT for acceleration of real-time compilers. 3ther technologies recently intro $ce that are now stan ar on the ARM"0 architect$re incl$ e Tr$st6one technology for sec$rity! Th$m#-& technology for co e ensity an the 78*". floating point architect$re.

2 Overview of the Cortex Architecture


The $nifying technology of Cortex processors is Th$m#-& technology. The Th$m#-& instr$ction set com#ines 9:- an .&-#it instr$ctions to impro"e co e ensity an performance. The original ARM instr$ction set consists of fixe -length .&-#it instr$ctions! while the Th$m# instr$ction set employs 9:#it instr$ctions. ;eca$se not all operations mappe into the original Th$m# instr$ction set! m$ltiple instr$ctions were sometimes nee e to em$late the tas/ of one .&-#it instr$ction. Th$m#-& technology a s a#o$t 9.' a itional instr$ctions to Th$m#. The a e f$nctionality remo"es the nee to switch #etween ARM an Th$m# mo es in or er to ser"ice interr$pts! an gi"es access to the f$ll set of processor registers. The res$lting co e maintains the tra itional co e ensity of Th$m# instr$ctions while r$nning at the performance le"els of .&-#it ARM co e. 2ntire applications can now #e written in Th$m#-& technology! remo"ing the original architect$re re<$ire for mo e switching.

An entire application can #e written $sing space-sa"ing Th$m#-& instr$ctions! whereas with the original Th$m# mo e the processor ha to switch #etween ARM an Th$m# mo es. Ma/ing its first appearance in an ARM processor is the 1231 me ia an signal processing technology targete at a$ io! "i eo an .( graphics. )t is a :=%9&8-#it hy#ri +)M( architect$re. 1231 technology has its own register file an exec$tion pipeline which are separate from the main ARM integer pipeline. )t can han le #oth integer an single precision floating-point "al$es! an incl$ es s$pport for $naligne ata accesses an easy loa ing of interlea"e ata store in str$ct$re form. >sing 1231 technology to perform typical m$ltime ia f$nctions! the Cortex-A8 processor can eco e M*2?-= 7?A "i eo ,incl$ ing ering! e#loc/ filters an y$"&rg#- at .' frames%sec at &0@ MA5! an A.&:= 7i eo at .@' MA5 aneeshr2020 !mail"com Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com

Also new is 4a5elle RCT technology! an architect$re extension that c$ts the memory footprint of B$st-intime ,4)T- #yteco e applications to a thir of their original si5e. The smaller co e si5e res$lts in a #oost performance an a re $ction of power. Tr$st6one technology is incl$ e in the Cortex-A8 to ens$re ata pri"acy an (RM protection in cons$mer pro $cts li/e mo#ile phones! personal igital assistants an set-top #oxes that r$n open operating systems. )mplemente within the processor core! Tr$st6one technology protects peripherals an memory against a sec$rity attac/. A sec$re monitor within the core ser"es as a gate/eeper switching the system #etween sec$re an non-sec$re states. )n the sec$re state! the processor r$ns Ctr$ste D co e from a sec$re co e #loc/ to han le sec$rity-sensiti"e tas/s s$ch as a$thentication an signat$re manip$lation. ;esi es contri#$ting to the processorEs signal processing performance! 1231 technology ena#les software sol$tions to ata processing applications. The res$lt is a flexi#le platform which can accommo ate new algorithms an new applications as they emerge with simply the ownloa of new software or a ri"er. The 78*". technology is an enhancement to the 78*"& technology. 1ew feat$res incl$ e a o$#ling of the n$m#er of o$#le-precision registers to .&! an the intro $ction of instr$ctions that perform con"ersions #etween fixe -point an floating-point n$m#ers.

xplorin! "eatures of the Cortex-A8 Microarchitecture

The Cortex-A8 processor is the most sophisticate low-power esign yet pro $ce #y ARM. To achie"e its high le"els of performance! new microarchitect$re feat$res were a e which are not tra itionally fo$n in the ARM architect$re! incl$ ing a $al in-or er iss$e ARM integer pipeline! an integrate F& cache an a eep 9.-stage pipe.

"i!# ARM Cortex-A8 architecture

aneeshr2020 !mail"com

Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com

3$1 %uper scalar pipeline


*erhaps the most significant of these new feat$res is the $al-iss$e! in-or er! statically sche $le ARM integer pipeline. *re"io$s ARM processors ha"e only a single integer exec$tion pipeline. The a#ility to iss$e two ata processing instr$ctions at the same time significantly increases the maxim$m potential instr$ctions exec$te per cycle. )t was eci e to stay with in-or er iss$e to /eep a itional power re<$ire to a minim$m. 3$t-of-or er iss$e an retire can re<$ire extensi"e amo$nts of logic cons$ming extra power. The choice to go with in-or er also allows for fire-an -forget instr$ction iss$e! th$s remo"ing critical paths from the esign an re $cing the nee for c$stom esign in the pipeline. +tatic sche $ling allows for extensi"e cloc/ gating for re $ce power $ring exec$tion. The $al AF> ,arithmetic logic $nit- pipelines ,AF> ' an AF> 9- are symmetric an #oth can han le most arithmetic instr$ctions. AF> pipe ' always carries the ol er of a pair of iss$e instr$ctions. The Cortex-A8 processor also has m$ltiplier an loa -store pipelines! #$t these o not carry a itional instr$ctions to the two AF> pipelines. These can #e tho$ght of as C epen entD pipelines. Their $se re<$ires sim$ltaneo$s $se of one of the AF> pipelines. The m$ltiplier pipeline can only #e co$ple with instr$ctions that are in AF> ' pipelines! whereas the loa -store pipeline can #e co$ple with instr$ctions in either AF>.

3$2 &ranch prediction


The 9.-stage pipeline was selecte to ena#le significantly higher operating fre<$encies than pre"io$s generations of ARM microarchitect$res. 1ote that stage 8' is not co$nte #eca$se it is only a ress generation. To minimi5e the #ranch penalties typically associate with a eeper pipeline! the Cortex-A8 processor implements a two-le"el glo#al history #ranch pre ictor. )t consists of two ifferent str$ct$resG the ;ranch Target ;$ffer ,;T;- an the ?lo#al Aistory ;$ffer ,?A;- which are accesse in parallel with instr$ction fetches. The ;T; in icates whether or not the c$rrent fetch a ress will ret$rn a #ranch instr$ction an its #ranch target a ress. )t contains @9& entries. 3n a hit in the ;T; a #ranch is pre icte an the ?A; is accesse . The ?A; consists of ='H: &-#it sat$rating co$nters that enco e the strength an irection information of #ranches. The ?A; is in exe #y 9'-#it history of the irection of the last ten #ranches enco$ntere an = #its of the *C. )n a ition to the ynamic #ranch pre ictor! a ret$rn stac/ is $se to pre ict s$#ro$tine ret$rn a resses. The ret$rn stac/ has eight .&-#it entries that store the lin/ register "al$e in r9= ,register 9=an the ARM or Th$m# state of the calling f$nction. When a ret$rn-type instr$ction is pre icte ta/en! the ret$rn stac/ pro"i es the last p$she a ress an state.

3$3 'evel-1 cache


The Cortex-A8 processor has a single-cycle loa -$se penalty for fast access to the Fe"el-9 caches. The ata an instr$ction caches are config$ra#le to 9:/ or .&/. 2ach is =-way set associati"e an $ses a Aash 7irt$al A ress ;$ffer ,A7A;- way pre iction scheme to impro"e timing an re $ce power cons$mption. The caches are physically a resse ,"irt$al in ex! physical tag- an ha"e har ware s$pport for a"oi ing aliase entries. *arity is s$pporte with one parity #it per #yte.

aneeshr2020 !mail"com

Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com

The replacement policy for the ata cache is write-#ac/ with no write allocates. Also incl$ e is a store #$ffer for ata merging #efore writing to main memory. The A7A; is a no"el approach to re $cing the power re<$ire for accessing the caches. )t $ses a pre iction scheme to etermine which way of the RAM to ena#le #efore an access.

3$(

'evel-2 cache

The Cortex-A8 processor incl$ es an integrate Fe"el-& cache. This gi"es the Fe"el-& cache a e icate low latency! high #an wi th interface to the Fe"el-l cache. This minimi5es the latency of Fe"el-9 cache linefills an oes not conflict with traffic on the main system #$s. )t can #e config$re in si5es from :=/ to &M. The Fe"el-& cache is physically a resse an 8-way set associati"e. )t is a $nifie ata an instr$ction cache! an s$pports optional 2CC an *arity. Write #ac/! write thro$gh! an write-allocate policies are followe accor ing to page ta#le settings. A pse$ o-ran om allocation policy is $se . The contents of the Fe"el-9 ata cache are excl$si"e with the Fe"el-& cache! whereas the contents of the Fe"el-9 instr$ction cache are a s$#set of the Fe"el-& cache. The tag an ata RAMs of the Fe"el-& cache are accesse serially for power sa"ings.

3$) *eon-media en!ine


The Cortex-A8 processors 1231 me ia processing engine pipeline starts at the en of the main integer pipeline. As a res$lt! all exceptions an #ranch mispre ictions are resol"e #efore instr$ctions reach it. More importantly! there is a 5ero loa -$se penalty for ata in the Fe"el-9 cache. The ARM integer $nit generates the a resses for 1231 loa s an stores as they pass thro$gh the pipeline! th$s allowing ata to #e fetche from the Fe"el-9 cache #efore it is re<$ire #y a 1231 ata processing operation. (eep instr$ction an loa - ata #$ffering #etween the 1231 engine! the ARM integer $nit an the memory system allow the latency of Fe"el-& accesses to #e hi en for streame ata. A store #$ffer pre"ents 1231 stores from #loc/ing the pipeline an etects a ress collisions with the ARM integer $nit accesses an 1231 loa s.

aneeshr2020 !mail"com

Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com The 1231 $nit is eco$ple from the main ARM integer pipeline #y the 1231 instr$ction <$e$e ,1)I-. The ARM )nstr$ction 2xec$te >nit can iss$e $p to two "ali instr$ctions to the 1231 $nit each cloc/ cycle. 1231 has 9&8-#it wi e loa an store paths to the Fe"el-9 an Fe"el-& cache! an s$pports streaming from #oth. The 1231 me ia engine has its own 9' stage pipeline that #egins at the en ARM integer pipeline. +ince all mispre icts an exceptions ha"e #een resol"e in the ARM integer $nit! once an instr$ction has #een iss$e to the 1231 me ia engine it m$st #e complete as it cannot generate exceptions. 1231 has three +)M( integer pipelines! a loa -store%perm$te pipeline! two +)M( singleprecision floating-point pipelines! an a non-pipeline 7ector 8loating-*oint $nit ,78*Fite-. 1231 instr$ctions are iss$e an retire in-or er. A ata processing instr$ction is either a 1231 integer instr$ction or a 1231 floating-point instr$ction. The Cortex-A8 1231 $nit oes not parallel iss$e two ata-processing instr$ctions to a"oi the area o"erhea with $plicating the ata-processing f$nctional #loc/s! an to a"oi timing critical paths an complexity o"erhea associate with the m$xing of the rea an write register ports. The 1231 integer atapath consists of three pipelinesG an integer m$ltiply%acc$m$late pipeline ,MAC-! an integer +hift pipeline! an an integer AF> pipeline. A loa -store%perm$te pipeline is responsi#le for all 1231 loa %stores! ata transfers to%from the integer $nit! an ata perm$te operations s$ch as interlea"e an e-interlea"e. The 1231 floating-point ,18*- atapath has two main pipelinesG a m$ltiply pipeline an an a pipeline. The separate 78*Fite $nit is a non-pipeline implementation of the ARM 78*". 8loating *oint +pecification targete for me i$m performance )222 0@= compliant floating point s$pport. 78*Fite is $se to pro"i e #ac/war s compati#ility with existing ARM floating point co e an to pro"i e )222 0@= compliant single an o$#le precision arithmetic. The CFiteD refers to area an performance! not f$nctionality.

aneeshr2020 !mail"com

Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com

( Implementation
;eca$se of the aggressi"e performance! power! an area targets ,**A- of the Cortex-A8 processor! new implementation flows ha"e #een e"elope in or er to meet goals witho$t resorting to a f$ll-c$stom implementation. The res$lting flows ena#le fine t$ning of the esign to the esire application. The res$lt is f$n amentally a cell-#ase flow! #$t $n er it lies semi-c$stom techni<$es that ha"e #een $se where necessary to meet performance. The Cortex-A8 processor $ses a com#ination of synthesi5e ! str$ct$re ! an c$stom circ$its. The esign was i"i e into se"en f$nctional $nits an then s$# i"i e into #loc/s! an the appropriate implementation techni<$e chosen for each. +ince the entire esign is synthesi5ea#le! #loc/s that can easily meet their **A goals can stic/ with a stan ar synthesis flow. A str$ct$re flow is $se for #loc/s which contain logic that can ta/e a "antage of controlle placement an ro$ting approach to meet timing or area goals. This approach is a semi-c$stom flow that man$ally maps the #loc/ into a gate-le"el netlist an specifies a relati"e placement for all the cells in the #loc/. The relati"e placement oes not specify the exact locations of the cells #$t how each cell is place with respect to the other cells in the #loc/. The str$ct$re approach is typically $se for ata #loc/s that ha"e reg$lar str$ct$re. The logic implementation an technology mapping of the #loc/ is one man$ally to maintain the reg$lar ataoriente #$s str$ct$re of the #loc/ instea of generating a ran om gate str$ct$re thro$gh synthesis. The logic gates of the #loc/ are place accor ing to the flow of ata thro$gh the #loc/. This approach offers more control o"er the esign than an a$tomate synthesis approach an lea s to a more pre icta#le timing clos$re. )t is also possi#le to get #etter performance an area on complex! high-performance esigns than tra itional techni<$es. The res$lting netlists may #e interprete with tra itional tiling techni<$es $sing the ARM Artisan A "antage-C2 or compati#le li#rary. The Artisan A "antage-C2 li#rary contains more than a tho$san cells. ;esi es the stan ar cells $se typical synthesis li#raries! many tactical cells are incl$ e more in line with c$stom implementation techni<$es. These are $se in an a$tomate fashion in the str$ct$re flow. The li#rary is specifically esigne to eal with the high- ensity ro$ting re<$irements of high-performance processors with a foc$s on #oth high spee operation an low static an ynamic power. Fea/age re $ction is achie"e thro$gh power gating MT-CM3+ cells an retention flip-fops to s$pport sleep an stan #y mo es. ARM has wor/e with tool "en ors to ens$re s$pport for this critical new flow. 8inally! a few of the most critical timing an area sensiti"e #loc/s of the esign are reser"e for f$ll c$stom techni<$es. This incl$ es memory arrays! register files an score#oar s. These #loc/s contain a mix of static an ynamic logic. 1o self-time circ$its are $se .

) Conclusion
The Cortex-A8 processor is the fastest! most power-efficient microprocessor yet e"elope #y ARM. With the a#ility to eco e 7?A A.&:= "i eo in $n er .@'MA5! it pro"i es the me ia processing power re<$ire for next generation wireless an cons$mer pro $cts while cons$ming less than .''mW in :@nm technologies. )ts new 1231 technology pro"i es a platform for flexi#le software-#ase sol$tions for me ia processing. Th$m#-& instr$ctions pro"i e co e ensity while maintaining the performance of stan ar ARM co eJ 4a5elle RCT technology oes li/ewise for realtime compilers. Tr$st6one technology pro"i es sec$rity for sensiti"e ata an (RM. Many significant new microarchitect$re feat$res ma/e their first appearance on the Cortex-A8 processor. These incl$ e a $al iss$e! in-or er s$perscalar pipeline! an integrate Fe"el-& cache an a aneeshr2020 !mail"com Aneesh R

Architecture and Implementation of the ARM Cortex-A8 Microprocessor

ANEESH R

aneeshr2020 !mail"com significantly eeper pipeline than precio$s ARM processors. To meet its aggressi"e performance targets while maintaining ARMs tra itional small power #$ get! new flows ha"e #een e"elope which approach the efficiency of c$stom techni<$es while /eeping the flexi#ility of an a$tomate flow. The Cortex-A8 processor is a <$ant$m B$mp in flexi#le low power! high-performance processing.

aneeshr2020 !mail"com

Aneesh R

You might also like