You are on page 1of 47

Lecture: 10

VLIW & Dynamic Binary Translation

Department of Electrical Engineering


Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009 Lecture 10 - 1 Christos Kozyrakis
Announcements

• HW2 is due today


– We will post the solutions tonight

• Review session on Friday


y 10/23,, 2-3pm,
p , Gates 498
– Review superscalar & VLIW techniques
– Review HW2 solutions

EE382A – Autumn 2009 Lecture 10 - 2 Christos Kozyrakis


Review: VLIW

Int Slot 1 Int Slot 2 Mem Slot 1 Mem Slot 2 FP Slot  Branch Slot

• Long instruction words (or packets or bundles)


– Each word contains multiple operations
• No data dependencies between operations in a word (parallelism)
– No need for RAW checks
• Each operation slot corresponds to specific functional unit
– Operation latencies are typically fixed
• Words spaced apart statically (using nops)
– All operations are ready to execute
EE382A – Autumn 2009 Lecture 10 - 3 Christos Kozyrakis
Review: Realistic VLIW Design using Clustering

Multi Ported Register File


Multi-Ported Multi Ported Register File
Multi-Ported

FAdd FMul FMul FDiv


(1 cycle) 4 cyc pipe 4 cyc unpipe 16 cycle

Instruction
Memory

Condition Codes
Sequencer

How would you deal with cache misses?


EE382A – Autumn 2009 Lecture 10 - 4 Christos Kozyrakis
Review: VLIW Today

• Servers: Intel IA-64 architecture


– EPIC ISA (explicitly parallel instruction computing)
– Chips Merced, Madison, Montecito, Tukwilla, …
– Questionable success compared to superscalars

• Embedded: TI, NXP, ST, …


– Very successful in low-end and high-end embedded SOCs
– Rational:
R ti l good
d performance/power,
f / no need
d ffor bi
binary compatibility
tibilit
– Large variety of ISAs, designs, optimizations points

• Interesting VLIWs of the recent past


– Transmeta Crusoe (x86 on VLIW)

EE382A – Autumn 2009 Lecture 10 - 5 Christos Kozyrakis


IA-64 vs. Classic VLIW

• Similarities:
– Compiler generated wide instructions with ILP encoded in the binary
– Static detection of dependencies
– Large number of architected registers

• Differences:
– Instructions in a bundle can have dependencies
– Hardware interlocks between dependent instructions
– Accommodates varying number of functional units and latencies
– Allows
All d
dynamic
i scheduling
h d li and
d ffunctional
i l unit
i bi
binding
di
Static scheduling are “suggestive” rather than absolute
– Code compatibility across generations
but software won’t run at top speed until it is recompiled so “shrink-wrap
binary” might need to include multiple builds
EE382A – Autumn 2009 Lecture 10 - 6 Christos Kozyrakis
IA-64 Architecture

• 128 general-purpose registers


• 128 floating-point
fl ti i t registers
i t
• Arbitrary number of functional units
• Arbitrary latencies on the functional units
• Arbitrary number of memory ports
• Arbitrary implementation of the memory hierarchy

Needs retargetable compiler and recompilation to achieve maximum


program performance
f on different
diff t IA-64
IA 64 iimplementations
l t ti

EE382A – Autumn 2009 Lecture 10 - 7 Christos Kozyrakis


IA-64 Instruction Format

• IA-64 “Bundle”
– Total of 128 bits
– Contains three IA-64 instructions (aka syllables)
– Template bits in each bundle specify dependencies both within a bundle as
well as between sequential bundles
– A collection of independent bundles forms a “group”
A more efficient and flexible way to encode ILP then a fixed VLIW format

inst1 inst2 inst3 temp

• IA-64 Instruction
– Fixed-length 40 bits long
– Contains three 7-bit register specifiers
– Contains a 6-bit field for specifying
p y g one of the 64 one-bit p
predicate registers
g

EE382A – Autumn 2009 Lecture 10 - 8 Christos Kozyrakis


Interesting Features of IA64

• Predicated execution
• S
Speculative,
l ti non-faulting
f lti L Load
d iinstruction
t ti
• Software-assisted branch prediction
• Register stack
• Rotating register frame
• Software-assisted memory hierarchy

EE382A – Autumn 2009 Lecture 10 - 9 Christos Kozyrakis


Predicated Execution

• Each instruction can be separately predicated


• 64 one-bit
bit predicate
di t registers
i t
– Each instruction carries a 6-bit predicate field
• An instruction is effectively
y a NOP if its p
predicate is false
• Assumes IA-64 processors have lots of spare resources
– Converts control flow into dataflow

cmp
br
p1 p2 ←cmp
else1
p2 else1 p1 then1
else2
join1
br
p1 then2 p2 else2
then1
jjoin2
then2
join1
EE382A – Autumn 2009 join2 Lecture 10 - 10 Christos Kozyrakis
Speculative, Non-Faulting Load

ld.s r1=[a]
inst 1 inst 1
inst 2 unsafe inst 2
…. code ….
motion br
br

…. ld r1=[a]
r1 [a] …. chk.s r1 ld r1=[a]
use=r1 use=r1

• ld.s fetches speculatively


p y from memory
y
i.e. any exception due to ld.s is suppressed
• If ld.s r did not cause an exception then chk.s r is an NOP, else a
branch is taken (to some compensation code)

EE382A – Autumn 2009 Lecture 10 - 11 Christos Kozyrakis


Speculative, Non-Faulting Load

ld.s r1=[a]
inst 1
inst 1 i t2
inst
inst 2 unsafe use=r1
…. code ….
br motion
br br

…. ld r1=[a]
r1 [a] …. chk s use
chk.s ld r1=[a]
r1 [a]
use=r1 use=r1

• Speculatively load data can be consumed prior to check


• “speculation” status is propagated with speculated data
• Any instruction that uses a speculative result also becomes speculative itself
((i.e. suppressed
pp exceptions)
p )
• chk.s checks the entire dataflow sequence for exceptions

EE382A – Autumn 2009 Lecture 10 - 12 Christos Kozyrakis


Speculative “Advanced” Load

inst 1 ld.a r1=[x]


potential
p
i t2
inst i t1
inst
aliasing
…. inst 2
st [[?]]
st[?] ….
…. st [?]
ld r1=[x] ….
use=r1 ld c r1=[x]
ld.c
use=r1
• ld.a starts the monitoring of any store to the same address as the
advanced load
• If no aliasing has occurred since ld.a, ld.c is a NOP
• If aliasing has occurred
occurred, ld.c
ld c re-loads from memory

EE382A – Autumn 2009 Lecture 10 - 13 Christos Kozyrakis


Using Speculative Load Results

inst 1 ld.a r1=[x]


potential
inst 2 inst 1
aliasing
…. inst 2
st [?]
st[?] use=r1
…. ….
ld r1=[x] st [?]
use=r1 ….
chk.a r1 ld r1=[a]
[ ]
…. use=r1

EE382A – Autumn 2009 Lecture 10 - 14 Christos Kozyrakis


Branch Prediction

• Static branch hints can be encoded with every branch


– taken vs
vs. not-taken
not taken
– whether to allocate an entry in the dynamic BP hardware
• SW and HW has joint control of BP hardware
– “brp”
brp (branch prediction) instruction can be issued ahead of the actual branch to
preset the contents of BPT and BTAC
Itanium uses a 512-entry 2-level BPT and 64-entry BTAC
• TAR (Target Address Register)
– a small, fully-associative BTAC-like structure
– contents are controlled entirely by a “prepare-to-branch” inst.
– a hit in TAR overrides all other predictions
• RSB (Return Address Stack)
– Procedure return addr is pushed (or popped) when a procedure is called (or when it
returns)
– Predicts nPC when executing register-indirect branches

EE382A – Autumn 2009 Lecture 10 - 15 Christos Kozyrakis


Register Renaming

2 and up
• 128 general purpose physical integer

R127
registers

er R32 to R
name R32
• Register names R0 to R31 are static
and refer to the first 32 physical GPRs

register n

physiccal registe
• Register names R32 to R127 are known
as “rotating registers” and are renamed
onto the remaining 96 physical registers
by an offset
• Remapping wraps around the rotating
offset

al registerr
registers such that when offset is non-

R0 to R31

R0 to R31
g name
zero, physical location of R127 is just
below R32

physica
reg
EE382A – Autumn 2009 Lecture 10 - 16 Christos Kozyrakis
Register Stack for Procedure Calls

ee frame
free
free free
out args

nded calle
call alloc
locals

temp. ccallee
out args me in args in args

me
caller fram

fram

expan
locals offset life but not offset life but not
accessed accessed
in args
offset
ff t static static static
GPRs GPRs GPRs

• On a procedure call
call, the rename offset is bumped to the beginning of output argument
registers
• Callee can then allocate its own working frame (up to 96 regs)
• If there isn’t enough free regs to be allocated, HW automatically frees up space by spilling
life contents not in the current frame to memory
Register stack appears infinite to SW
EE382A – Autumn 2009 Lecture 10 - 17 Christos Kozyrakis
Software-Assisted Memory Hierarchies

• ISA provides for separate storages for “temporal” vs “non-temporal”


data each with its own multiple level of hierarchies
data,
• Load and Store instructions can give hints about where cached copies
should be held after a cache miss

temporal
L1 L2 L3
Main
Memory
non-temporal-L1
NT NT
non-temporal-L2 NT L3
L1 L2
non-temporal-All

EE382A – Autumn 2009 Lecture 10 - 18 Christos Kozyrakis


Latest Itanium Core
L1I
Cache (16KB)
Branch
Prediction
Instruction
TLB • 6 wide instruction fetch and issue
B B B I I M M M M F F
– 6 wide integer, 2 wide FP, 4 wide
ld/st, 3 wide branch
– 1 cycle L1 data cache
Register Stack Engine / Re-name

Branch &
Predicate
Integer
Floating
Point – 8-stage pipeline
Registers
– 2-threads
Registers Registers

Integer Memory/ Floating


Branch Unit
Unit Integer Point Unit

E
Exception
ti D Detect
t t and
d IInstruction
t ti Commit
C it
• Memory
y hierarchy
y
• Separate L1I, L1D, L2I, and L2D
• 6MB L3
Data L1D
ALAT
TLB Cache (16KB)

L2I L2D
• In 4-core Tuckwilla: 30MB L3!
Cache (512KB) Cache (256KB)

Queues/ L3
Control Cache (6MB)

Intel QuickPath® interconnect Interface Logic

EE382A – Autumn 2009 Lecture 10 - 19 Christos Kozyrakis


DBT: Dynamic Binary Translation

EE382A – Autumn 2009 Lecture 10 - 20 Christos Kozyrakis


DBT: Dynamic Binary Translation

• Up to now, CPU hardware design


– Pipelining,
Pipelining superscalar
superscalar, speculative execution
execution, …
– ISA to interface hardware features (VLIW)

• What is dynamic binary translation?


– Software tool to translate a binary code to another code at runtime

• How does dynamic binary translation work?


– Internals, DBT APIs, …

• Use cases
– Binaryy compatibility,
p y pprofiling,
g optimizations,
p debugging,
gg g …

EE382A – Autumn 2009 Lecture 10 - 21 Christos Kozyrakis


What is binary translation?

• Translating programs in one binary format to another

• What : different types of translation


– Different ISAs
• E.g.
E PowerPC
P PC =>
> x86
86 : tto portt programs across platform
l tf
– Same ISAs
• E.g. x86 => x86 : code optimization or feature instrumentation
– Intermediate Representation
• E.g. Java bytecode => x86 : to avoid interpretation

• When
– Static : translation before running programs
– Dynamic : translation while running programs

EE382A – Autumn 2009 Lecture 10 - 22 Christos Kozyrakis


Dynamic Binary Translation Overview

Original Program

Dynamic Binary Translator

Hardware Platform

• Dynamic program modifier


– Start with interpretation or intercept program execution
– Observe a sequence of instructions
– Produce new code and save in code cache
• Change the jump/branch target address properly
– Manipulate or add instructions as needed

EE382A – Autumn 2009 Lecture 10 - 23 Christos Kozyrakis


DBT Configurations (1)

• Cross p
platform • Same p
platform
– E.g. Crusoe (Transmeta) – E.g. Dynamo (HP), PIN (x86),
Dynamo-Rio(x86)

Application Application

DLL OS DLL
OS
Code Morph Dynamo

CPU = VLIW CPU = HP PA


PA-8000
8000

EE382A – Autumn 2009 Lecture 10 - 24 Christos Kozyrakis


DBT Configurations (2)

• Virtual Machine • JIT compilation


p
– E.g. ESX server (vmWare) – E.g. JVM (Sun), C#(MS)

Application Application Application

DLL OS DLL OS JVM

ESX server DLL OS

CPU = x86 CPU

EE382A – Autumn 2009 Lecture 10 - 25 Christos Kozyrakis


Why is DBT a good idea?

• Feature support without hardware/source-code modification


– E.g.
E g Binary compatibility
compatibility, virtual machine
machine, …

• Vs. static binary translation


– Access to complete program
• Programs are fully linked
– No need to re-link
• Translate instruction and jump to it
– Access to program state
• Include dynamic
y values
– Handle self-modifying code
– Can adapt to changes in program behavior

EE382A – Autumn 2009 Lecture 10 - 26 Christos Kozyrakis


How does it work?

• Combination of DBT framework + tool


• DBT framework
f k
– Analyze program
• Instruction type and arguments
• Basic block and control flow
• Register and memory values
– Provide translation primitives
• Analyze instruction
• Add/delete instruction
• Change control flow
• Read and write register values
• Read and write memory values
– Optimize the translated code

EE382A – Autumn 2009 Lecture 10 - 27 Christos Kozyrakis


How does it work? (2)

• DBT tool
– Perform actual translation
– Provide two types of routine
• Translation(instrumentation) routine : when and what to translate
• Analysis
A l i routine
ti : actual
t l iinstructions
t ti tto b
be ttranslated
l t d
– Add metadata for translation/analysis routine

EE382A – Autumn 2009 Lecture 10 - 28 Christos Kozyrakis


DBT Framework : PIN for x86

Pintool Address
PIN Space

Instrumentation APIs

Virtual Machine (VM)


Appliccation

JIT Compiler Code


Cache
Emulation Unit

Operating System

H d
Hardware

EE382A – Autumn 2009 Lecture 10 - 29 Christos Kozyrakis


Application and DBT tool : Instruction Counter

• Instrumented Applications
pp • DBT tool ((PIN style)
y )
– When and how to translate
counter++;
sub $0xff, %edx // instrumentation routine
counter++; void Instruction (INS ins, void *void) {
cmp %esi, %edx
INS_InsertCall (ins, BEFORE, cnt);
counter++;;
}
jle <L1>
counter++;
mov $0x1, %edi // analysis routine
counter++; void cnt ( ) {
add $0x10, %eax counter++;
}

EE382A – Autumn 2009 Lecture 10 - 30 Christos Kozyrakis


Instrumentation Options

• Instrumentation point
– Before instruction
– After instruction
• For branch, fault-through or taken
– Anywhere
• Will be optimized for performance (e.g. register spill)
• Instrumentation granularity
– Instruction
– Basic block
• Single entry, single exit
– Trace
• A group of basic blocks executed back-to-back
• Analysis routine signature
– To pass useful information to instrumented function

EE382A – Autumn 2009 Lecture 10 - 31 Christos Kozyrakis


Translation Loop in VM

Dispatch Interpret
Loop Instruction

N
No No
No Yes Translate
Instruction Hot
with End of block?
translated ? Path?
DBT tooll
Yes
Yes
Execute
Block

* Some DBTs have no interpretation mode


EE382A – Autumn 2009 Lecture 10 - 32 Christos Kozyrakis
Code Cache

• Software cache in virtual address space


– Keep translated code in memory for later reuse
– Essential to leverage the high cost of translation and of good optimizations
– Trade-off: cost of memoryy vs. higher
g reuse
• Allocation policy
– Remember every translated blocks: doubling code size at least
– Pick
Pi k up h
hott path:
th temporal
t l locality
l lit
• Granularity
– Basic block
• Easy, but frequent context switch between interpretation and direct execution
– Trace
• Amortize context switch overhead at the cost of p
path p
profiling
g

EE382A – Autumn 2009 Lecture 10 - 33 Christos Kozyrakis


Linking Translated Basic Blocks (1)

• Context-switch overhead between interpretation and translation


– Additional instructions for context switch
– Register Spilling
– Limited DBT code optimization
p boundary
y
• E.g. Constant propagation, ILP scheduling, loop unrolling, …
• Link translated code
– Block reordering for direct branch
– Indirect branch target (IBT) table for indirect branch
• (key : value) = (original target, translated target)
• Inline preferred target
– Inline the block of the preferred target
– Add check code

EE382A – Autumn 2009 Lecture 10 - 34 Christos Kozyrakis


Linking Translated Basic Blocks (2)

Indirect Branch Target


<load actual target>
((IBT)) Table
<compare to inlined target>
If equal goto <inlined target> Original Target Translated Target

Lockup IBT table


If (!tag-match)
goto <exit stub>
Jump to tag-value

<inlined target> <exit stub> < translated target>

EE382A – Autumn 2009 Lecture 10 - 35 Christos Kozyrakis


Exceptions

• Asynchronous exceptions (interrupts)


– Can be delayed, easy
– Wait until the current translated code finishes
– Translate exception handler
– Invoke translated exception handler
• Synchronous exception
– During
g interpretation,
p , no p
problem
– During executing translated code, either revert the instruction execution
and interpret
• E.g. checkpoint support in Crusoe’s VLIW hardware
– Or make sure to stop at exact point
• E.g NullPointerException in Java
– Invoke translated exception handler

EE382A – Autumn 2009 Lecture 10 - 36 Christos Kozyrakis


Self-modifying code

• Solution 1 : stick to interpretation


– Modifying code located in heap
– Always interpret when jumping to heap address

• Solution 2 : invalidate translated code


– When jumping to heap address, translate the code and cache it
– Write-protect
Write protect the pages with modifying code
– Execute translated code
– If the code changes later, page-fault exception is triggered
– Invalidate
I lid corresponding
di translated
l d code
d off the
h page iin code
d cache
h
– Next time, the code is re-translated

EE382A – Autumn 2009 Lecture 10 - 37 Christos Kozyrakis


Example Use for Binary Compatibility:
QuickTransit (Transitive)

Solaris/SPARC app

SPARC Decoder
x86
Application Architecture Independent Quick
O ti i i K
Optimizing Kernell Transit

x86 Code Generator

Server Linux x86

• Goal : Run applications


pp for Solaris/SPARC on Linux/x86
• Problem : Migrations are hard
– Often 50% of legacy software system should be decommissioned
• Use DBT

EE382A – Autumn 2009 Lecture 10 - 38 Christos Kozyrakis


Example Use for Architecture Study:
Cache Simulation (1)

Invoke
Analysis/Instrumentation
Routine Memory
Application
pp
Cache Address Cache
DBT Tool Model
DBT Framework
DBT API

• Cache simulation
– Have cache model in virtual memory
– Instrument code to call cache model for each read/write memory instruction
– Profile cache accesses

EE382A – Autumn 2009 Lecture 10 - 39 Christos Kozyrakis


Architecture Study : Cache Simulation (2)

Cache_t cacheHierarchy[NUM_CORE][NUM_LEVEL];

// Instrumentation routine
Void Instruction (INS ins, void* param) {
if (INS_IsMemoryRead(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, false); }
if (INS
(INS_IsMemoryWrite(ins))
IsMemoryWrite(ins)) {
INS_InsertCall (cacheAccess, BEFORE, INT, param, BOOL, true); }}

// Analysis routine
Void cacheAccess (Int address, bool isWrite) {
// access cacheHierarchy }

EE382A – Autumn 2009 Lecture 10 - 40 Christos Kozyrakis


Example Use for Debugging:
memCheck (Valgrind)
• Valgrind : Heavy DBT
– Prefer functionality than performance
• Debugging, profiling, …
– Support shadow memory
• M
Maintains
i t i duplicate
d li t copy off every memory bytes
b t and
d registers
i t used
dbby
applications
• Memcheck
– Help debugging memory errors
• Memory leak, access to freed or un-initialized data
– Per memory byte
• A (addressability) bit : if set, application can reference the byte
• V (validity) bit : if set, the data is defined
– Heap blocks for all live memory blocks

EE382A – Autumn 2009 Lecture 10 - 41 Christos Kozyrakis


Example Use for Dynamic Code Optimization:
Dynamo
• Dynamic optimization by software
– Dynamic : leverage runtime information (compared with compiler)
– Software : flexible, sophisticated (compared with out-of-order execution)
• But, with time constraints
• Runtime information
– Trace, DLL, function call counter, dynamic values
• Optimizations
– Superblock : allows additional chance for classic optimization
• ILP scheduling, copy propagation, loop unrolling
– Inlining : balance code size and function call overhead
– Fast shared library invocation : remove lookup overhead
– Register allocation : reduce register spill

EE382A – Autumn 2009 Lecture 10 - 42 Christos Kozyrakis


DBT Overheads

• DBT framework
– Translation (instrumentation) overhead
• Code cache, linking translated code, and hot trace selection

• DBT tool
– Analysis overhead
• Runtime overhead from instrumented instructions
– Work done in analysis routine
– Frequency to analysis routine
– Transition
T iti tto analysis
l i routine
ti

EE382A – Autumn 2009 Lecture 10 - 43 Christos Kozyrakis


Reducing Analysis Routine Overhead

Counter += 3; // Instrumentation routine


sub $0xff, %edx void Instruction (INS ins, void *void) {
cmp %esi, %edx BBL_InsertCall (cnt, BEFORE,
jjle <L1> ARG INT BBL
ARG_INT, BBL_NumIns
NumIns );}

Counter +=2;
// Analysis routine
mov $0x1, %edi
add $0x10, %eax void cnt (int c ) {
counter +=c; }

• Shift computation from analysis routine to instrumentation routine


– E.g. counting the number of instructions in analysis routine
• Code Optimization
p by
y DBT
– Redundant code elimination, register renaming, …
EE382A – Autumn 2009 Lecture 10 - 44 Christos Kozyrakis
Reducing Frequency of Calling Analysis Routine

Counter += 5;
sub $0xff, %edx
cmp %esi, %edx
jle <L1>

mov $0x1, %edi L1:


add $0x10, %eax Counter -=2;

• Instrument at larger granularity


– Instruction < basic block < trace

EE382A – Autumn 2009 Lecture 10 - 45 Christos Kozyrakis


Reducing Analysis Transition Overhead

Int icount = 0; Int icount = 0;

Void samplePrint (void* ip) { Bool countDown () {


icount--; icount--;
if (icount % 1000) { return icount % 1000; Inlined
fprintf (trace, “%p\n”, ip); }
}
} Void Print (void*
(void ip) {
fprintf (trace, “%p\n”, ip); Not
Inlined
}
• Conditional Inlining
g
– Separate if-then and if-else functions
• Instrumentation Scheduling
– Use ANYWHERE for better register spilling

EE382A – Autumn 2009 Lecture 10 - 46 Christos Kozyrakis


Summary

• Dynamic binary translation is a powerful software tool


– Translate binary code to another
– Same ISAs, different ISAs, IRs
• Use cases
– Binary compatibility, profiling, architecture study, debugging, virtual
machine, security, reliability, …
• Performance
e o a ce opt
optimization
at o
– Code cache, linking translated blocks, hot trace selection
– Software techniques for efficient DBT tool
– Architectural support
• A attractive solution to provide new features without hardware/source-
code modification

EE382A – Autumn 2009 Lecture 10 - 47 Christos Kozyrakis

You might also like