You are on page 1of 28

Advanced cache optimizations

Advancedcacheoptimizations
ECE154B
Dmitri Strukov
DmitriStrukov

Advanced Cache Optimization


AdvancedCacheOptimization
1)Wayprediction
)
yp
2)Victimcache
3)Pipelinedcache
4)Nonblocking cache
5)Multibanked cache
6)Criticalwordfirstandearlyrestart
7)Mergingwritebuffer
8) C
8)Compileroptimizations
il
i i i
9)Prefetching

#1:WayPrediction
HowtocombinefasthittimeofDirectMappedandhavethelowerconflict
missesof2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway,orblockwithin
Way prediction: keep extra bits in cache to predict the way or block within
theset,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthatclock
cycleinparallelwithreadingthecachedata
Miss
Miss 1st checkotherblocksformatchesinnextclockcycle
check other blocks for matches in next clock cycle

HitTime
WayMissHitTime

MissPenalty

Accuracy
Accuracy 85%
Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.L1datacaches
AlsousedonMIPSR10KforoffchipL2unifiedcache,waypredictiontableonchip

#2: Victim Cache


#2:VictimCache

Efficientforthrashingproblemindirectmappedcaches
Remove20%90%cachemissestoL1cache
L1 d Vi ti
L1andVictimcacheareexclusive
h
l i
MisstoL1buthitinVC;missinL1andVC

#3: Pipelining Cache Writes


#3:PipeliningCacheWrites

#4:Nonblocking Cache:BasicIdea

Nonblocking Cache
Nonblockingcache orlockupfreecache allow
d
datacachetocontinuetosupplycachehitsduring
h
i
l
h hi d i
amiss
hitundermissreducestheeffectivemiss
t u de
ss educes t e e ect e ss
penaltybyworkingduringmissvs.ignoringCPU
requests
hitundermultiplemiss
under multiple missor
or miss
missundermiss
under miss
hit
mayfurtherlowertheeffectivemisspenaltyby
overlappingmultiplemisses
Pentium
PentiumProallows4outstandingmemorymisses
Pro allows 4 outstanding memory misses
(CrayX1Evectorsupercomputerallows2,048
outstandingmemorymisses)

NonBlockingCache

Figure2.5Theeffectivenessofanonblocking cacheisevaluatedbyallowing1,2,or64hitsunderacachemisswith9SPECINT(on
)
(
g )
Thedatamemorysystemmodeled
y y
aftertheInteli7consistsofa32KBL1cache
theleft)and9SPECFP(ontheright)benchmarks.
withafourcycleaccesslatency.TheL2cache(sharedwithinstructions)is256KBwitha10clockcycleaccesslatency.The L3 is2MB
anda36cycleaccesslatency.Allthecachesareeightwaysetassociativeandhavea64byteblocksize.Allowingonehitundermiss
reducesthemisspenaltyby9%fortheintegerbenchmarksand12.5%forthefloatingpoint.Allowingasecondhitimprovesthese
resultsto10%and16%,andallowing64resultsinlittleadditionalimprovement.

Nonblocking CacheImplementation
Cache Implementation
requiresoutoforderexecution
q
significantlyincreasesthecomplexityofthecache
controllerastherecanbemultipleoutstanding
memory accesses
memoryaccesses
requirespipelinedorbankedmemorysystem
(otherwisecannotsupport)

Nonblocking CacheExample
Cache Example
Maximum
Maximumnumberofoutstandingreferences
number of outstanding references
tomaintainpeakbandwidthforasystem?

sustainedtransferrate16GB/sec
sustained
transfer rate 16GB/sec
memoryaccess36ns
block size 64 bytes
blocksize64bytes
50%neednotbeissued

Nonblocking CacheExample
Cache Example
Maximum
Maximumnumberofoutstandingreferences
number of outstanding references
tomaintainpeakbandwidthforasystem?

sustainedtransferrate16GB/sec
/
memoryaccess36ns
blocksize64bytes
50%neednotbeissued

Answer:(16*10)^9/64*36*10^9*2=18

#5:IncreasingCacheBandwidthvia
M lti l B k
MultipleBanks
Ratherthantreatthecacheasasinglemonolithic
block,divideintoindependentbanksthatcansupport
,
p
pp
simultaneousaccesses
4inL1and8inL2forIntelcorei7

Banking
Bankingworksbestwhenaccessesnaturallyspread
works best when accesses naturally spread
themselvesacrossbanks mappingofaddressesto
banksaffectsbehaviorofmemorysystem
Simplemappingthatworkswellis
Simple mapping that works well is sequential
sequential
interleaving
Spreadblockaddressessequentiallyacrossbanks
E,g,ifthere4banks,Bank0hasallblockswhoseaddress
E g if there 4 banks Bank 0 has all blocks whose address
modulo4is0;bank1hasallblockswhoseaddress
modulo4is1;

#6:ReduceMissPenalty:
Early Restart and Critical Word First
EarlyRestartandCriticalWordFirst
DontwaitforfullblockbeforerestartingCPU
EarlyrestartAssoonastherequestedwordoftheblock
Early restart As soon as the requested word of the block
arrives,sendittotheCPUandlettheCPUcontinue
execution
Spatial
Spatiallocality
locality tendtowantnextsequentialword,sonot
tend to want next sequential word so not
clearsizeofbenefitofjustearlyrestart

CriticalWordFirstRequestthemissedwordfirstfrom
memory and send it to the CPU as soon as it arrives; let the
memoryandsendittotheCPUassoonasitarrives;letthe
CPUcontinueexecutionwhilefillingtherestofthewordsin
theblock
Longblocksmorepopulartoday
Long blocks more popular today CriticalWord1
Critical Word 1st Widelyused
Widely used

bl k
block

#7:MergingWriteBufferto
ReduceMissPenalty
d
i
l
Writebuffertoallowprocessortocontinue
p
whilewaitingtowritetomemory
Ifbuffercontainsmodifiedblocks,theaddresses
can be checked to see if address of new data
canbecheckedtoseeifaddressofnewdata
matchestheaddressofavalidwritebufferentry
Ifso,newdataarecombinedwiththatentry
If so, new data are combined with that entry
Increasesblocksizeofwriteforwritethrough
cacheofwritestosequentialwords,bytessince
multiwordwritesmoreefficienttomemory
li
d i
ffi i
TheSunT1(Niagara)processor,amongmany
others uses write merging
others,useswritemerging

MergingWriteBufferExample

Figure2.7Toillustratewritemerging,thewritebufferontopdoesnotuseitwhilethewritebufferonthebottomdoes. The
fourwritesaremergedintoasinglebufferentrywithwritemerging;withoutit,thebufferisfulleventhoughthreefourthsof
g
g
y
g g
g
eachentryiswasted.Thebufferhasfourentries,andeachentryholdsfour64bitwords.Theaddressforeachentryisonthe
left,withavalidbit(V)indicatingwhetherthenextsequential8bytesinthisentryareoccupied.(Withoutwritemerging, the
wordstotherightintheupperpartofthefigurewouldonlybeusedforinstructionsthatwrotemultiplewordsatthesame
time.)

Interestingissuewithconflictingdesignobjectives,i.e.ejectingassoonaspossiblevs.
keepinglongerformerging

#8:CompilerOptimizations

LoopInterchange

LoopFusion

Blocking

Blocking

#9:Prefetching

IssuesinPrefetching

HardwareInstructionPrefetching

HardwareDataPrefetching

SoftwarePrefetching

SoftwarePrefetching Issues

Summary

Acknowledgements
Someoftheslidescontainmaterialdeveloped
Some
of the slides contain material developed
andcopyrightedbyArvind,Emer(MIT),
Asanovic (UCB/MIT)andinstructormaterial
(UCB/MIT) and instructor material
forthetextbook

28

You might also like