Professional Documents
Culture Documents
Instruction
Memory
Condition Codes
Sequencer
• Similarities:
– Compiler generated wide instructions with ILP encoded in the binary
– Static detection of dependencies
– Large number of architected registers
• Differences:
– Instructions in a bundle can have dependencies
– Hardware interlocks between dependent instructions
– Accommodates varying number of functional units and latencies
– Allows
All d
dynamic
i scheduling
h d li and
d ffunctional
i l unit
i bi
binding
di
Static scheduling are “suggestive” rather than absolute
– Code compatibility across generations
but software won’t run at top speed until it is recompiled so “shrink-wrap
binary” might need to include multiple builds
EE382A – Autumn 2009 Lecture 10 - 6 Christos Kozyrakis
IA-64 Architecture
• IA-64 “Bundle”
– Total of 128 bits
– Contains three IA-64 instructions (aka syllables)
– Template bits in each bundle specify dependencies both within a bundle as
well as between sequential bundles
– A collection of independent bundles forms a “group”
A more efficient and flexible way to encode ILP then a fixed VLIW format
• IA-64 Instruction
– Fixed-length 40 bits long
– Contains three 7-bit register specifiers
– Contains a 6-bit field for specifying
p y g one of the 64 one-bit p
predicate registers
g
• Predicated execution
• S
Speculative,
l ti non-faulting
f lti L Load
d iinstruction
t ti
• Software-assisted branch prediction
• Register stack
• Rotating register frame
• Software-assisted memory hierarchy
cmp
br
p1 p2 ←cmp
else1
p2 else1 p1 then1
else2
join1
br
p1 then2 p2 else2
then1
jjoin2
then2
join1
EE382A – Autumn 2009 join2 Lecture 10 - 10 Christos Kozyrakis
Speculative, Non-Faulting Load
ld.s r1=[a]
inst 1 inst 1
inst 2 unsafe inst 2
…. code ….
motion br
br
…. ld r1=[a]
r1 [a] …. chk.s r1 ld r1=[a]
use=r1 use=r1
ld.s r1=[a]
inst 1
inst 1 i t2
inst
inst 2 unsafe use=r1
…. code ….
br motion
br br
…. ld r1=[a]
r1 [a] …. chk s use
chk.s ld r1=[a]
r1 [a]
use=r1 use=r1
2 and up
• 128 general purpose physical integer
R127
registers
er R32 to R
name R32
• Register names R0 to R31 are static
and refer to the first 32 physical GPRs
register n
physiccal registe
• Register names R32 to R127 are known
as “rotating registers” and are renamed
onto the remaining 96 physical registers
by an offset
• Remapping wraps around the rotating
offset
al registerr
registers such that when offset is non-
R0 to R31
R0 to R31
g name
zero, physical location of R127 is just
below R32
physica
reg
EE382A – Autumn 2009 Lecture 10 - 16 Christos Kozyrakis
Register Stack for Procedure Calls
ee frame
free
free free
out args
nded calle
call alloc
locals
temp. ccallee
out args me in args in args
me
caller fram
fram
expan
locals offset life but not offset life but not
accessed accessed
in args
offset
ff t static static static
GPRs GPRs GPRs
• On a procedure call
call, the rename offset is bumped to the beginning of output argument
registers
• Callee can then allocate its own working frame (up to 96 regs)
• If there isn’t enough free regs to be allocated, HW automatically frees up space by spilling
life contents not in the current frame to memory
Register stack appears infinite to SW
EE382A – Autumn 2009 Lecture 10 - 17 Christos Kozyrakis
Software-Assisted Memory Hierarchies
temporal
L1 L2 L3
Main
Memory
non-temporal-L1
NT NT
non-temporal-L2 NT L3
L1 L2
non-temporal-All
Branch &
Predicate
Integer
Floating
Point – 8-stage pipeline
Registers
– 2-threads
Registers Registers
E
Exception
ti D Detect
t t and
d IInstruction
t ti Commit
C it
• Memory
y hierarchy
y
• Separate L1I, L1D, L2I, and L2D
• 6MB L3
Data L1D
ALAT
TLB Cache (16KB)
L2I L2D
• In 4-core Tuckwilla: 30MB L3!
Cache (512KB) Cache (256KB)
Queues/ L3
Control Cache (6MB)
• Use cases
– Binaryy compatibility,
p y pprofiling,
g optimizations,
p debugging,
gg g …
• When
– Static : translation before running programs
– Dynamic : translation while running programs
Original Program
Hardware Platform
• Cross p
platform • Same p
platform
– E.g. Crusoe (Transmeta) – E.g. Dynamo (HP), PIN (x86),
Dynamo-Rio(x86)
Application Application
DLL OS DLL
OS
Code Morph Dynamo
• DBT tool
– Perform actual translation
– Provide two types of routine
• Translation(instrumentation) routine : when and what to translate
• Analysis
A l i routine
ti : actual
t l iinstructions
t ti tto b
be ttranslated
l t d
– Add metadata for translation/analysis routine
Pintool Address
PIN Space
Instrumentation APIs
Operating System
H d
Hardware
• Instrumented Applications
pp • DBT tool ((PIN style)
y )
– When and how to translate
counter++;
sub $0xff, %edx // instrumentation routine
counter++; void Instruction (INS ins, void *void) {
cmp %esi, %edx
INS_InsertCall (ins, BEFORE, cnt);
counter++;;
}
jle <L1>
counter++;
mov $0x1, %edi // analysis routine
counter++; void cnt ( ) {
add $0x10, %eax counter++;
}
• Instrumentation point
– Before instruction
– After instruction
• For branch, fault-through or taken
– Anywhere
• Will be optimized for performance (e.g. register spill)
• Instrumentation granularity
– Instruction
– Basic block
• Single entry, single exit
– Trace
• A group of basic blocks executed back-to-back
• Analysis routine signature
– To pass useful information to instrumented function
Dispatch Interpret
Loop Instruction
N
No No
No Yes Translate
Instruction Hot
with End of block?
translated ? Path?
DBT tooll
Yes
Yes
Execute
Block
Solaris/SPARC app
SPARC Decoder
x86
Application Architecture Independent Quick
O ti i i K
Optimizing Kernell Transit
Invoke
Analysis/Instrumentation
Routine Memory
Application
pp
Cache Address Cache
DBT Tool Model
DBT Framework
DBT API
• Cache simulation
– Have cache model in virtual memory
– Instrument code to call cache model for each read/write memory instruction
– Profile cache accesses
Cache_t cacheHierarchy[NUM_CORE][NUM_LEVEL];
// Instrumentation routine
Void Instruction (INS ins, void* param) {
if (INS_IsMemoryRead(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, false); }
if (INS
(INS_IsMemoryWrite(ins))
IsMemoryWrite(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, true); }}
// Analysis routine
Void cacheAccess (Int address, bool isWrite) {
// access cacheHierarchy }
• DBT framework
– Translation (instrumentation) overhead
• Code cache, linking translated code, and hot trace selection
• DBT tool
– Analysis overhead
• Runtime overhead from instrumented instructions
– Work done in analysis routine
– Frequency to analysis routine
– Transition
T iti tto analysis
l i routine
ti
Counter +=2;
// Analysis routine
mov $0x1, %edi
add $0x10, %eax void cnt (int c ) {
counter +=c; }
Counter += 5;
sub $0xff, %edx
cmp %esi, %edx
jle <L1>