You are on page 1of 29

Co-Synthesis Algorithms: HW/SW Partitioning

Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)


Winter-Spring 2001 Codesign of Embedded Systems 1

Topics

Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis

Winter-Spring 2001

Codesign of Embedded Systems

Topics

Introduction A Classification Examples

Vulcan Cosyma

Winter-Spring 2001

Codesign of Embedded Systems

Introduction to HW/SW Partitioning


The first variety of co-synthesis applications Definition

A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Multiprocessor architecture = one CPU + some ASICs on CPU bus

Usually

Winter-Spring 2001

Codesign of Embedded Systems

Introduction to HW/SW Partitioning (contd)

A Terminology

Allocation

Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture The process of assigning PE (CPU and/or ASICs) time to processes to get executed

Scheduling

Winter-Spring 2001

Codesign of Embedded Systems

Introduction to HW/SW Partitioning (contd)

In most partitioning algorithms


Type of CPU is fixed and given ASICs must be synthesized


What function to implement on each ASIC? What characteristics should the implementation have? CDFG is the starting model

Are single-rate synthesis problems

Winter-Spring 2001

Codesign of Embedded Systems

HW/SW Partitioning (contd)

Normal use of architectural components

CPU performs less computationally-intensive functions ASICs used to accelerate core functions High-performance applications

Where to use?

No CPU is fast enough for the operations ASIC accelerators allow use of much smaller, cheaper CPU
Codesign of Embedded Systems 7

Low-cost application

Winter-Spring 2001

A Classification

Criterion: Optimization Strategy

Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.

Primal Approach

Dual Approach

Winter-Spring 2001

Codesign of Embedded Systems

A Classification (contd)

Classification due to optimization strategy (contd)

Example co-synthesis systems


Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy

Winter-Spring 2001

Codesign of Embedded Systems

Co-Synthesis Algorithms: HW/SW Partitioning


HW/SW Partitioning Examples: Vulcan

Winter-Spring 2001

Codesign of Embedded Systems

10

Partitioning Examples: Vulcan


Gupta, De Micheli, Stanford University Primal approach


1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost.

System specification language

HardwareC

Is compiled into a flow graph

Winter-Spring 2001

Codesign of Embedded Systems

11

Partitioning Examples: Vulcan (contd)


x=a; y=b;
nop

HardwareC

1
x=a

1
y=b

cond

if (c>d) x=e; else y=f;

c>d
x=e

c<=d
y=f

HardwareC
Winter-Spring 2001 Codesign of Embedded Systems 12

Partitioning Examples: Vulcan (contd)

Flow Graph Definition


A variation of a (single-rate) task graph Nodes


Represent operations Typically low-level operations: mult, add Represent data dependencies Each contains a Boolean condition under which the edge is traversed

Edges

Winter-Spring 2001

Codesign of Embedded Systems

13

Partitioning Examples: Vulcan (contd)

Flow Graph

is executed repeatedly at some rate can have initiation-time constraints for each node

t(vj)+lij t(vj) t(vj)+uij mi Ri Mi

can have rate constraints on each node

Winter-Spring 2001

Codesign of Embedded Systems

14

Partitioning Examples: Vulcan (contd)

Vulcan Co-synthesis Algorithm

Partitioning quantum is a thread

Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph

Target architecture

CPU + Co-processor (multiple ASICs)

Winter-Spring 2001

Codesign of Embedded Systems

15

Partitioning Examples: Vulcan (contd)

Vulcan Co-synthesis algorithm (contd)

Allocation

Primal approach

Scheduling

is done by a scheduler on the target CPU


is generated as part of synthesis process schedules all threads (both HW and SW threads)

cannot be static, due to some threads non-deterministic initiation-time

Winter-Spring 2001

Codesign of Embedded Systems

16

Partitioning Examples: Vulcan (contd)

Vulcan Co-synthesis algorithm (contd)

Cost estimation

SW implementation

Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread

HW implementation ?
Codesign of Embedded Systems 17

Winter-Spring 2001

Partitioning Examples: Vulcan (contd)

Vulcan Co-synthesis algorithm (contd)

Performance estimation

Both SW- and HW-implementation

From flow-graph, and basic execution times for the operators

Winter-Spring 2001

Codesign of Embedded Systems

18

Partitioning Examples: Vulcan (contd)

Algorithm Details

Partitioning goal

Allocate each thread to one of two partitions


CPU Set: FS Co-processor set: FH

Required execution-rate must be met, and total cost minimized

Winter-Spring 2001

Codesign of Embedded Systems

19

Partitioning Examples: Vulcan (contd)

Algorithm Details (contd) Algorithm steps


1. Put all threads in FH set 2. Iteratively do
2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition

Winter-Spring 2001

Codesign of Embedded Systems

20

Partitioning Examples: Vulcan (contd)

Algorithm Details (contd)

Vulcan cost function


f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|

c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor

Winter-Spring 2001

Codesign of Embedded Systems

21

Partitioning Examples: Vulcan (contd)

Algorithm Details (contd) Complementary notes

A heuristic to minimize communication

Once a thread is moved to FS, its immediate successors


are placed in the list for evaluation in the next iteration.

No back-track

Once a thread is assigned to FS, it remains there

Experimental results

considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced
Codesign of Embedded Systems 22

Winter-Spring 2001

Co-Synthesis Algorithms: HW/SW Partitioning


HW/SW Partitioning Examples: Cosyma

Winter-Spring 2001

Codesign of Embedded Systems

23

Partitioning Examples: Cosyma

Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach


1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective.

System specification language

Cx

Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG
Codesign of Embedded Systems 24

Winter-Spring 2001

Partitioning Examples: Cosyma (contd)

Cosyma Co-synthesis Algorithm

Partitioning quantum is a Basic Block

A Basic Blocks is a branch-free block of program

Target Architecture

CPU + accelerator ASIC(s)

Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details


Codesign of Embedded Systems 25

Winter-Spring 2001

Partitioning Examples: Cosyma (contd)

Cosyma Co-synthesis Algorithm (contd)

Performance Estimation

SW implementation

Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed
Codesign of Embedded Systems

HW implementation

Communication

Winter-Spring 2001

26

Partitioning Examples: Cosyma (contd)

Algorithm Steps

Change in execution-time caused by moving basic block b from CPU to ASIC:

Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)


w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks
implemented on the ASIC It(b): Total number of times that b is executed
Codesign of Embedded Systems

Winter-Spring 2001

27

Partitioning Examples: Cosyma (contd)

Experimental Results

By moving only basic-blocks to HW


Typical speedup of only 2x Reason:

Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation
Codesign of Embedded Systems

Cure:

Result:

Winter-Spring 2001

28

What we learned today

HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized

Winter-Spring 2001

Codesign of Embedded Systems

29

You might also like