Lecture 10

Lecture: 10
VLIW & Dynamic Binary Translation
Department of Electrical Engineering

Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009 Lecture 10 - 1 Christos Kozyrakis
Announcements
• HW2 is due today

– We will post the solutions tonight
• Review session on Friday

y 10/23,, 2-3pm,
p , Gates 498
– Review superscalar & VLIW techniques
– Review HW2 solutions

Review: VLIW
Int Slot 1 Int Slot 2 Mem Slot 1 Mem Slot 2 FP Slot Branch Slot
• Long instruction words (or packets or bundles)

– Each word contains multiple operations
• No data dependencies between operations in a word (parallelism)
– No need for RAW checks
• Each operation slot corresponds to specific functional unit
– Operation latencies are typically fixed
• Words spaced apart statically (using nops)
– All operations are ready to execute
Review: Realistic VLIW Design using Clustering
Multi Ported Register File

Multi-Ported Multi Ported Register File
Multi-Ported
FAdd FMul FMul FDiv

(1 cycle) 4 cyc pipe 4 cyc unpipe 16 cycle
Instruction
Memory
Condition Codes
Sequencer
How would you deal with cache misses?

Review: VLIW Today
• Servers: Intel IA-64 architecture

– EPIC ISA (explicitly parallel instruction computing)
– Chips Merced, Madison, Montecito, Tukwilla, …
– Questionable success compared to superscalars
• Embedded: TI, NXP, ST, …

– Very successful in low-end and high-end embedded SOCs
– Rational:
R ti l good
d performance/power,
f / no need
d ffor bi
binary compatibility
tibilit
– Large variety of ISAs, designs, optimizations points
• Interesting VLIWs of the recent past

– Transmeta Crusoe (x86 on VLIW)

IA-64 vs. Classic VLIW
• Similarities:
– Compiler generated wide instructions with ILP encoded in the binary
– Static detection of dependencies
– Large number of architected registers
• Differences:
– Instructions in a bundle can have dependencies
– Hardware interlocks between dependent instructions
– Accommodates varying number of functional units and latencies
– Allows
All d
dynamic
i scheduling
h d li and
d ffunctional
i l unit
i bi
binding
di
Static scheduling are “suggestive” rather than absolute
– Code compatibility across generations
but software won’t run at top speed until it is recompiled so “shrink-wrap
binary” might need to include multiple builds
IA-64 Architecture
• 128 general-purpose registers

• 128 floating-point
fl ti i t registers
i t
• Arbitrary number of functional units
• Arbitrary latencies on the functional units
• Arbitrary number of memory ports
• Arbitrary implementation of the memory hierarchy
Needs retargetable compiler and recompilation to achieve maximum

program performance
f on different
diff t IA-64
IA 64 iimplementations
l t ti

IA-64 Instruction Format
• IA-64 “Bundle”
– Total of 128 bits
– Contains three IA-64 instructions (aka syllables)
– Template bits in each bundle specify dependencies both within a bundle as
well as between sequential bundles
– A collection of independent bundles forms a “group”
A more efficient and flexible way to encode ILP then a fixed VLIW format
inst1 inst2 inst3 temp
• IA-64 Instruction
– Fixed-length 40 bits long
– Contains three 7-bit register specifiers
– Contains a 6-bit field for specifying
p y g one of the 64 one-bit p
predicate registers
g

Interesting Features of IA64
• Predicated execution
• S
Speculative,
l ti non-faulting
f lti L Load
d iinstruction
t ti
• Software-assisted branch prediction
• Register stack
• Rotating register frame
• Software-assisted memory hierarchy

Predicated Execution
• Each instruction can be separately predicated

• 64 one-bit
bit predicate
di t registers
i t
– Each instruction carries a 6-bit predicate field
• An instruction is effectively
y a NOP if its p
predicate is false
• Assumes IA-64 processors have lots of spare resources
– Converts control flow into dataflow
cmp
br
p1 p2 ←cmp
else1
p2 else1 p1 then1
else2
join1
br
p1 then2 p2 else2
then1
jjoin2
then2
join1
EE382A – Autumn 2009 join2 Lecture 10 - 10 Christos Kozyrakis
Speculative, Non-Faulting Load
ld.s r1=[a]
inst 1 inst 1
inst 2 unsafe inst 2
…. code ….
motion br
br
…. ld r1=[a]
r1 [a] …. chk.s r1 ld r1=[a]
use=r1 use=r1
• ld.s fetches speculatively

p y from memory
y
i.e. any exception due to ld.s is suppressed
• If ld.s r did not cause an exception then chk.s r is an NOP, else a
branch is taken (to some compensation code)

Speculative, Non-Faulting Load
ld.s r1=[a]
inst 1
inst 1 i t2
inst
inst 2 unsafe use=r1
…. code ….
br motion
br br
…. ld r1=[a]
r1 [a] …. chk s use
chk.s ld r1=[a]
r1 [a]
use=r1 use=r1
• Speculatively load data can be consumed prior to check

• “speculation” status is propagated with speculated data
• Any instruction that uses a speculative result also becomes speculative itself
((i.e. suppressed
pp exceptions)
p )
• chk.s checks the entire dataflow sequence for exceptions

Speculative “Advanced” Load
inst 1 ld.a r1=[x]

potential
p
i t2
inst i t1
inst
aliasing
…. inst 2
st [[?]]
st[?] ….
…. st [?]
ld r1=[x] ….
use=r1 ld c r1=[x]
ld.c
use=r1
• ld.a starts the monitoring of any store to the same address as the
advanced load
• If no aliasing has occurred since ld.a, ld.c is a NOP
• If aliasing has occurred
occurred, ld.c
ld c re-loads from memory

Using Speculative Load Results
inst 1 ld.a r1=[x]

potential
inst 2 inst 1
aliasing
…. inst 2
st [?]
st[?] use=r1
…. ….
ld r1=[x] st [?]
use=r1 ….
chk.a r1 ld r1=[a]
[ ]
…. use=r1

Branch Prediction
• Static branch hints can be encoded with every branch

– taken vs
vs. not-taken
not taken
– whether to allocate an entry in the dynamic BP hardware
• SW and HW has joint control of BP hardware
– “brp”
brp (branch prediction) instruction can be issued ahead of the actual branch to
preset the contents of BPT and BTAC
Itanium uses a 512-entry 2-level BPT and 64-entry BTAC
• TAR (Target Address Register)
– a small, fully-associative BTAC-like structure
– contents are controlled entirely by a “prepare-to-branch” inst.
– a hit in TAR overrides all other predictions
• RSB (Return Address Stack)
– Procedure return addr is pushed (or popped) when a procedure is called (or when it
returns)
– Predicts nPC when executing register-indirect branches

Register Renaming
2 and up
• 128 general purpose physical integer
R127
registers
er R32 to R
name R32
• Register names R0 to R31 are static
and refer to the first 32 physical GPRs
register n
physiccal registe
• Register names R32 to R127 are known
as “rotating registers” and are renamed
onto the remaining 96 physical registers
by an offset
• Remapping wraps around the rotating
offset
al registerr
registers such that when offset is non-
R0 to R31
R0 to R31
g name
zero, physical location of R127 is just
below R32
physica
reg
Register Stack for Procedure Calls
ee frame
free
free free
out args
nded calle
call alloc
locals
temp. ccallee
out args me in args in args
me
caller fram
fram
expan
locals offset life but not offset life but not
accessed accessed
in args
offset
ff t static static static
GPRs GPRs GPRs
• On a procedure call
call, the rename offset is bumped to the beginning of output argument
registers
• Callee can then allocate its own working frame (up to 96 regs)
• If there isn’t enough free regs to be allocated, HW automatically frees up space by spilling
life contents not in the current frame to memory
Register stack appears infinite to SW
Software-Assisted Memory Hierarchies
• ISA provides for separate storages for “temporal” vs “non-temporal”

data each with its own multiple level of hierarchies
data,
• Load and Store instructions can give hints about where cached copies
should be held after a cache miss
temporal
L1 L2 L3
Main
Memory
non-temporal-L1
NT NT
non-temporal-L2 NT L3
L1 L2
non-temporal-All

Latest Itanium Core
L1I
Cache (16KB)
Branch
Prediction
Instruction
TLB • 6 wide instruction fetch and issue
B B B I I M M M M F F
– 6 wide integer, 2 wide FP, 4 wide
ld/st, 3 wide branch
– 1 cycle L1 data cache
Register Stack Engine / Re-name
Branch &
Predicate
Integer
Floating
Point – 8-stage pipeline
Registers
– 2-threads
Registers Registers
Integer Memory/ Floating

Branch Unit
Unit Integer Point Unit
E
Exception
ti D Detect
t t and
d IInstruction
t ti Commit
C it
• Memory
y hierarchy
y
• Separate L1I, L1D, L2I, and L2D
• 6MB L3
Data L1D
ALAT
TLB Cache (16KB)
L2I L2D
• In 4-core Tuckwilla: 30MB L3!
Cache (512KB) Cache (256KB)
Queues/ L3
Control Cache (6MB)
Intel QuickPath® interconnect Interface Logic

DBT: Dynamic Binary Translation

DBT: Dynamic Binary Translation
• Up to now, CPU hardware design

– Pipelining,
Pipelining superscalar
superscalar, speculative execution
execution, …
– ISA to interface hardware features (VLIW)
• What is dynamic binary translation?

– Software tool to translate a binary code to another code at runtime
• How does dynamic binary translation work?

– Internals, DBT APIs, …
• Use cases
– Binaryy compatibility,
p y pprofiling,
g optimizations,
p debugging,
gg g …

What is binary translation?
• Translating programs in one binary format to another
• What : different types of translation

– Different ISAs
• E.g.
E PowerPC
P PC =>
> x86
86 : tto portt programs across platform
l tf
– Same ISAs
• E.g. x86 => x86 : code optimization or feature instrumentation
– Intermediate Representation
• E.g. Java bytecode => x86 : to avoid interpretation
• When
– Static : translation before running programs
– Dynamic : translation while running programs

Dynamic Binary Translation Overview
Original Program
Dynamic Binary Translator
Hardware Platform
• Dynamic program modifier

– Start with interpretation or intercept program execution
– Observe a sequence of instructions
– Produce new code and save in code cache
• Change the jump/branch target address properly
– Manipulate or add instructions as needed

DBT Configurations (1)
• Cross p
platform • Same p
platform
– E.g. Crusoe (Transmeta) – E.g. Dynamo (HP), PIN (x86),
Dynamo-Rio(x86)
Application Application
DLL OS DLL
OS
Code Morph Dynamo
CPU = VLIW CPU = HP PA

PA-8000
8000

DBT Configurations (2)
• Virtual Machine • JIT compilation

p
– E.g. ESX server (vmWare) – E.g. JVM (Sun), C#(MS)
Application Application Application
DLL OS DLL OS JVM
ESX server DLL OS
CPU = x86 CPU

Why is DBT a good idea?
• Feature support without hardware/source-code modification

– E.g.
E g Binary compatibility
compatibility, virtual machine
machine, …
• Vs. static binary translation

– Access to complete program
• Programs are fully linked
– No need to re-link
• Translate instruction and jump to it
– Access to program state
• Include dynamic
y values
– Handle self-modifying code
– Can adapt to changes in program behavior

How does it work?
• Combination of DBT framework + tool

• DBT framework
f k
– Analyze program
• Instruction type and arguments
• Basic block and control flow
• Register and memory values
– Provide translation primitives
• Analyze instruction
• Add/delete instruction
• Change control flow
• Read and write register values
• Read and write memory values
– Optimize the translated code

How does it work? (2)
• DBT tool
– Perform actual translation
– Provide two types of routine
• Translation(instrumentation) routine : when and what to translate
• Analysis
A l i routine
ti : actual
t l iinstructions
t ti tto b
be ttranslated
l t d
– Add metadata for translation/analysis routine

DBT Framework : PIN for x86
Pintool Address
PIN Space
Instrumentation APIs
Virtual Machine (VM)

Appliccation
JIT Compiler Code

Cache
Emulation Unit
Operating System
H d
Hardware

Application and DBT tool : Instruction Counter
• Instrumented Applications
pp • DBT tool ((PIN style)
y )
– When and how to translate
counter++;
sub $0xff, %edx // instrumentation routine
counter++; void Instruction (INS ins, void *void) {
cmp %esi, %edx
INS_InsertCall (ins, BEFORE, cnt);
counter++;;
}
jle <L1>
counter++;
mov $0x1, %edi // analysis routine
counter++; void cnt ( ) {
add $0x10, %eax counter++;
}

Instrumentation Options
• Instrumentation point
– Before instruction
– After instruction
• For branch, fault-through or taken
– Anywhere
• Will be optimized for performance (e.g. register spill)
• Instrumentation granularity
– Instruction
– Basic block
• Single entry, single exit
– Trace
• A group of basic blocks executed back-to-back
• Analysis routine signature
– To pass useful information to instrumented function

Translation Loop in VM
Dispatch Interpret
Loop Instruction
N
No No
No Yes Translate
Instruction Hot
with End of block?
translated ? Path?
DBT tooll
Yes
Yes
Execute
Block
* Some DBTs have no interpretation mode

Code Cache
• Software cache in virtual address space

– Keep translated code in memory for later reuse
– Essential to leverage the high cost of translation and of good optimizations
– Trade-off: cost of memoryy vs. higher
g reuse
• Allocation policy
– Remember every translated blocks: doubling code size at least
– Pick
Pi k up h
hott path:
th temporal
t l locality
l lit
• Granularity
– Basic block
• Easy, but frequent context switch between interpretation and direct execution
– Trace
• Amortize context switch overhead at the cost of p
path p
profiling
g

Linking Translated Basic Blocks (1)
• Context-switch overhead between interpretation and translation

– Additional instructions for context switch
– Register Spilling
– Limited DBT code optimization
p boundary
y
• E.g. Constant propagation, ILP scheduling, loop unrolling, …
• Link translated code
– Block reordering for direct branch
– Indirect branch target (IBT) table for indirect branch
• (key : value) = (original target, translated target)
• Inline preferred target
– Inline the block of the preferred target
– Add check code

Linking Translated Basic Blocks (2)
Indirect Branch Target

<load actual target>
((IBT)) Table
<compare to inlined target>
If equal goto <inlined target> Original Target Translated Target
Lockup IBT table

If (!tag-match)
goto <exit stub>
Jump to tag-value
<inlined target> <exit stub> < translated target>

Exceptions
• Asynchronous exceptions (interrupts)

– Can be delayed, easy
– Wait until the current translated code finishes
– Translate exception handler
– Invoke translated exception handler
• Synchronous exception
– During
g interpretation,
p , no p
problem
– During executing translated code, either revert the instruction execution
and interpret
• E.g. checkpoint support in Crusoe’s VLIW hardware
– Or make sure to stop at exact point
• E.g NullPointerException in Java
– Invoke translated exception handler

Self-modifying code
• Solution 1 : stick to interpretation

– Modifying code located in heap
– Always interpret when jumping to heap address
• Solution 2 : invalidate translated code

– When jumping to heap address, translate the code and cache it
– Write-protect
Write protect the pages with modifying code
– Execute translated code
– If the code changes later, page-fault exception is triggered
– Invalidate
I lid corresponding
di translated
l d code
d off the
h page iin code
d cache
h
– Next time, the code is re-translated

Example Use for Binary Compatibility:
QuickTransit (Transitive)
Solaris/SPARC app
SPARC Decoder
x86
Application Architecture Independent Quick
O ti i i K
Optimizing Kernell Transit
x86 Code Generator
Server Linux x86
• Goal : Run applications

pp for Solaris/SPARC on Linux/x86
• Problem : Migrations are hard
– Often 50% of legacy software system should be decommissioned
• Use DBT

Example Use for Architecture Study:
Cache Simulation (1)
Invoke
Analysis/Instrumentation
Routine Memory
Application
pp
Cache Address Cache
DBT Tool Model
DBT Framework
DBT API
• Cache simulation
– Have cache model in virtual memory
– Instrument code to call cache model for each read/write memory instruction
– Profile cache accesses

Architecture Study : Cache Simulation (2)
Cache_t cacheHierarchy[NUM_CORE][NUM_LEVEL];
// Instrumentation routine
Void Instruction (INS ins, void* param) {
if (INS_IsMemoryRead(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, false); }
if (INS
(INS_IsMemoryWrite(ins))
IsMemoryWrite(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, true); }}
// Analysis routine
Void cacheAccess (Int address, bool isWrite) {
// access cacheHierarchy }

Example Use for Debugging:
memCheck (Valgrind)
• Valgrind : Heavy DBT
– Prefer functionality than performance
• Debugging, profiling, …
– Support shadow memory
• M
Maintains
i t i duplicate
d li t copy off every memory bytes
b t and
d registers
i t used
dbby
applications
• Memcheck
– Help debugging memory errors
• Memory leak, access to freed or un-initialized data
– Per memory byte
• A (addressability) bit : if set, application can reference the byte
• V (validity) bit : if set, the data is defined
– Heap blocks for all live memory blocks

Example Use for Dynamic Code Optimization:
Dynamo
• Dynamic optimization by software
– Dynamic : leverage runtime information (compared with compiler)
– Software : flexible, sophisticated (compared with out-of-order execution)
• But, with time constraints
• Runtime information
– Trace, DLL, function call counter, dynamic values
• Optimizations
– Superblock : allows additional chance for classic optimization
• ILP scheduling, copy propagation, loop unrolling
– Inlining : balance code size and function call overhead
– Fast shared library invocation : remove lookup overhead
– Register allocation : reduce register spill

DBT Overheads
• DBT framework
– Translation (instrumentation) overhead
• Code cache, linking translated code, and hot trace selection
• DBT tool
– Analysis overhead
• Runtime overhead from instrumented instructions
– Work done in analysis routine
– Frequency to analysis routine
– Transition
T iti tto analysis
l i routine
ti

Reducing Analysis Routine Overhead
Counter += 3; // Instrumentation routine

sub $0xff, %edx void Instruction (INS ins, void *void) {
cmp %esi, %edx BBL_InsertCall (cnt, BEFORE,
jjle <L1> ARG INT BBL
ARG_INT, BBL_NumIns
NumIns );}
Counter +=2;
// Analysis routine
mov $0x1, %edi
add $0x10, %eax void cnt (int c ) {
counter +=c; }
• Shift computation from analysis routine to instrumentation routine

– E.g. counting the number of instructions in analysis routine
• Code Optimization
p by
y DBT
– Redundant code elimination, register renaming, …
Reducing Frequency of Calling Analysis Routine
Counter += 5;
sub $0xff, %edx
cmp %esi, %edx
jle <L1>
mov $0x1, %edi L1:

add $0x10, %eax Counter -=2;
• Instrument at larger granularity

– Instruction < basic block < trace

Reducing Analysis Transition Overhead
Int icount = 0; Int icount = 0;
Void samplePrint (void* ip) { Bool countDown () {

icount--; icount--;
if (icount % 1000) { return icount % 1000; Inlined
fprintf (trace, “%p\n”, ip); }
}
} Void Print (void*
(void ip) {
fprintf (trace, “%p\n”, ip); Not
Inlined
}
• Conditional Inlining
g
– Separate if-then and if-else functions
• Instrumentation Scheduling
– Use ANYWHERE for better register spilling

Summary
• Dynamic binary translation is a powerful software tool

– Translate binary code to another
– Same ISAs, different ISAs, IRs
• Use cases
– Binary compatibility, profiling, architecture study, debugging, virtual
machine, security, reliability, …
• Performance
e o a ce opt
optimization
at o
– Code cache, linking translated blocks, hot trace selection
– Software techniques for efficient DBT tool
– Architectural support
• A attractive solution to provide new features without hardware/source-
code modification

Lecture 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10

Uploaded by

Copyright:

Available Formats

Lecture: 10

VLIW & Dynamic Binary Translation

Department of Electrical Engineering

• HW2 is due today

• Review session on Friday

EE382A – Autumn 2009 Lecture 10 - 2 Christos Kozyrakis

Int Slot 1 Int Slot 2 Mem Slot 1 Mem Slot 2 FP Slot Branch Slot

• Long instruction words (or packets or bundles)

Multi Ported Register File

FAdd FMul FMul FDiv

How would you deal with cache misses?

• Servers: Intel IA-64 architecture

• Embedded: TI, NXP, ST, …

• Interesting VLIWs of the recent past

EE382A – Autumn 2009 Lecture 10 - 5 Christos Kozyrakis

• 128 general-purpose registers

Needs retargetable compiler and recompilation to achieve maximum

EE382A – Autumn 2009 Lecture 10 - 7 Christos Kozyrakis

inst1 inst2 inst3 temp

EE382A – Autumn 2009 Lecture 10 - 8 Christos Kozyrakis

EE382A – Autumn 2009 Lecture 10 - 9 Christos Kozyrakis

• Each instruction can be separately predicated

• ld.s fetches speculatively

EE382A – Autumn 2009 Lecture 10 - 11 Christos Kozyrakis

• Speculatively load data can be consumed prior to check

EE382A – Autumn 2009 Lecture 10 - 12 Christos Kozyrakis

inst 1 ld.a r1=[x]

EE382A – Autumn 2009 Lecture 10 - 13 Christos Kozyrakis

inst 1 ld.a r1=[x]

EE382A – Autumn 2009 Lecture 10 - 14 Christos Kozyrakis

• Static branch hints can be encoded with every branch

EE382A – Autumn 2009 Lecture 10 - 15 Christos Kozyrakis

• ISA provides for separate storages for “temporal” vs “non-temporal”

EE382A – Autumn 2009 Lecture 10 - 18 Christos Kozyrakis

Integer Memory/ Floating

Intel QuickPath® interconnect Interface Logic

EE382A – Autumn 2009 Lecture 10 - 19 Christos Kozyrakis

EE382A – Autumn 2009 Lecture 10 - 20 Christos Kozyrakis

• Up to now, CPU hardware design

• What is dynamic binary translation?

• How does dynamic binary translation work?

EE382A – Autumn 2009 Lecture 10 - 21 Christos Kozyrakis

• Translating programs in one binary format to another

• What : different types of translation

EE382A – Autumn 2009 Lecture 10 - 22 Christos Kozyrakis

Dynamic Binary Translator

• Dynamic program modifier

EE382A – Autumn 2009 Lecture 10 - 23 Christos Kozyrakis

CPU = VLIW CPU = HP PA

EE382A – Autumn 2009 Lecture 10 - 24 Christos Kozyrakis

• Virtual Machine • JIT compilation

Application Application Application

DLL OS DLL OS JVM

ESX server DLL OS

CPU = x86 CPU

EE382A – Autumn 2009 Lecture 10 - 25 Christos Kozyrakis

• Feature support without hardware/source-code modification

• Vs. static binary translation

EE382A – Autumn 2009 Lecture 10 - 26 Christos Kozyrakis

• Combination of DBT framework + tool

EE382A – Autumn 2009 Lecture 10 - 27 Christos Kozyrakis

EE382A – Autumn 2009 Lecture 10 - 28 Christos Kozyrakis

Virtual Machine (VM)

JIT Compiler Code