Co-Synthesis Algorithms: HW/SW Partitioning Examples

Co-Synthesis Algorithms: HW/SW Partitioning
Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Winter-Spring 2001 Codesign of Embedded Systems 1
Topics

Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis
Winter-Spring 2001
Codesign of Embedded Systems
Topics

Introduction A Classification Examples
Vulcan Cosyma
Winter-Spring 2001
Introduction to HW/SW Partitioning

The first variety of co-synthesis applications Definition
A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Multiprocessor architecture = one CPU + some ASICs on CPU bus
Usually
Winter-Spring 2001
Introduction to HW/SW Partitioning (contd)
A Terminology
Allocation
Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture The process of assigning PE (CPU and/or ASICs) time to processes to get executed
Scheduling
Winter-Spring 2001
Introduction to HW/SW Partitioning (contd)
In most partitioning algorithms

Type of CPU is fixed and given ASICs must be synthesized

What function to implement on each ASIC? What characteristics should the implementation have? CDFG is the starting model
Are single-rate synthesis problems
Winter-Spring 2001
HW/SW Partitioning (contd)
Normal use of architectural components
CPU performs less computationally-intensive functions ASICs used to accelerate core functions High-performance applications
Where to use?
No CPU is fast enough for the operations ASIC accelerators allow use of much smaller, cheaper CPU
Codesign of Embedded Systems 7
Low-cost application
Winter-Spring 2001
A Classification
Criterion: Optimization Strategy
Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.
Primal Approach

Dual Approach

Winter-Spring 2001
A Classification (contd)
Classification due to optimization strategy (contd)
Example co-synthesis systems

Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy
Winter-Spring 2001

HW/SW Partitioning Examples: Vulcan
Winter-Spring 2001
10
Partitioning Examples: Vulcan

Gupta, De Micheli, Stanford University Primal approach

1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost.
System specification language
HardwareC
Is compiled into a flow graph
Winter-Spring 2001
11
Partitioning Examples: Vulcan (contd)

x=a; y=b;
nop
HardwareC
1
x=a
1
y=b
cond
if (c>d) x=e; else y=f;
c>d
x=e
c<=d
y=f
HardwareC
Winter-Spring 2001 Codesign of Embedded Systems 12
Flow Graph Definition

A variation of a (single-rate) task graph Nodes

Represent operations Typically low-level operations: mult, add Represent data dependencies Each contains a Boolean condition under which the edge is traversed
Edges

Winter-Spring 2001
13
Flow Graph

is executed repeatedly at some rate can have initiation-time constraints for each node
t(vj)+lij t(vj) t(vj)+uij mi Ri Mi
can have rate constraints on each node
Winter-Spring 2001
14
Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread
Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph
Target architecture
CPU + Co-processor (multiple ASICs)
Winter-Spring 2001
15
Vulcan Co-synthesis algorithm (contd)
Allocation
Primal approach
Scheduling
is done by a scheduler on the target CPU

is generated as part of synthesis process schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic initiation-time
Winter-Spring 2001
16
Cost estimation
SW implementation
Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread
HW implementation ?
Winter-Spring 2001
Performance estimation
Both SW- and HW-implementation
From flow-graph, and basic execution times for the operators
Winter-Spring 2001
18
Algorithm Details
Partitioning goal
Allocate each thread to one of two partitions

CPU Set: FS Co-processor set: FH
Required execution-rate must be met, and total cost minimized
Winter-Spring 2001
19
Algorithm Details (contd) Algorithm steps

1. Put all threads in FH set 2. Iteratively do
2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition
Winter-Spring 2001
20
Algorithm Details (contd)
Vulcan cost function

f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|

c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor
Winter-Spring 2001
21
Algorithm Details (contd) Complementary notes
A heuristic to minimize communication
Once a thread is moved to FS, its immediate successors

are placed in the list for evaluation in the next iteration.
No back-track
Once a thread is assigned to FS, it remains there
Experimental results
considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced
Winter-Spring 2001

HW/SW Partitioning Examples: Cosyma
Winter-Spring 2001
23
Partitioning Examples: Cosyma
Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach

1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective.
System specification language
Cx

Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG
Winter-Spring 2001
Partitioning Examples: Cosyma (contd)
Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block
A Basic Blocks is a branch-free block of program
Target Architecture
CPU + accelerator ASIC(s)
Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details

Winter-Spring 2001
Cosyma Co-synthesis Algorithm (contd)
Performance Estimation
SW implementation
Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed
HW implementation

Communication

Winter-Spring 2001
26
Algorithm Steps
Change in execution-time caused by moving basic block b from CPU to ASIC:
Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)

w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks
implemented on the ASIC It(b): Total number of times that b is executed
Winter-Spring 2001
27
Experimental Results
By moving only basic-blocks to HW

Typical speedup of only 2x Reason:
Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation
Cure:
Result:

Winter-Spring 2001
28
What we learned today
HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized
Winter-Spring 2001
29

Co-Synthesis Algorithms: HW/SW Partitioning Examples

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Co-Synthesis Algorithms: HW/SW Partitioning Examples

Uploaded by

Copyright:

Available Formats

Co-Synthesis Algorithms: HW/SW Partitioning

Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis

Codesign of Embedded Systems

Introduction A Classification Examples

Codesign of Embedded Systems

Introduction to HW/SW Partitioning

The first variety of co-synthesis applications Definition

Codesign of Embedded Systems

Introduction to HW/SW Partitioning (contd)

Codesign of Embedded Systems

Introduction to HW/SW Partitioning (contd)

In most partitioning algorithms

Type of CPU is fixed and given ASICs must be synthesized

Are single-rate synthesis problems

Codesign of Embedded Systems

HW/SW Partitioning (contd)

Normal use of architectural components

Criterion: Optimization Strategy

Codesign of Embedded Systems

Classification due to optimization strategy (contd)

Example co-synthesis systems

Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy

Codesign of Embedded Systems