Professional Documents
Culture Documents
Topics
Winter-Spring 2001
Topics
Vulcan Cosyma
Winter-Spring 2001
A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Multiprocessor architecture = one CPU + some ASICs on CPU bus
Usually
Winter-Spring 2001
A Terminology
Allocation
Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture The process of assigning PE (CPU and/or ASICs) time to processes to get executed
Scheduling
Winter-Spring 2001
What function to implement on each ASIC? What characteristics should the implementation have? CDFG is the starting model
Winter-Spring 2001
CPU performs less computationally-intensive functions ASICs used to accelerate core functions High-performance applications
Where to use?
No CPU is fast enough for the operations ASIC accelerators allow use of much smaller, cheaper CPU
Codesign of Embedded Systems 7
Low-cost application
Winter-Spring 2001
A Classification
Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.
Primal Approach
Dual Approach
Winter-Spring 2001
A Classification (contd)
Winter-Spring 2001
Winter-Spring 2001
10
HardwareC
Winter-Spring 2001
11
HardwareC
1
x=a
1
y=b
cond
c>d
x=e
c<=d
y=f
HardwareC
Winter-Spring 2001 Codesign of Embedded Systems 12
Represent operations Typically low-level operations: mult, add Represent data dependencies Each contains a Boolean condition under which the edge is traversed
Edges
Winter-Spring 2001
13
Flow Graph
is executed repeatedly at some rate can have initiation-time constraints for each node
Winter-Spring 2001
14
Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph
Target architecture
Winter-Spring 2001
15
Allocation
Primal approach
Scheduling
is generated as part of synthesis process schedules all threads (both HW and SW threads)
Winter-Spring 2001
16
Cost estimation
SW implementation
Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread
HW implementation ?
Codesign of Embedded Systems 17
Winter-Spring 2001
Performance estimation
Winter-Spring 2001
18
Algorithm Details
Partitioning goal
Winter-Spring 2001
19
Winter-Spring 2001
20
c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor
Winter-Spring 2001
21
No back-track
Experimental results
considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced
Codesign of Embedded Systems 22
Winter-Spring 2001
Winter-Spring 2001
23
Cx
Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG
Codesign of Embedded Systems 24
Winter-Spring 2001
Target Architecture
Winter-Spring 2001
Performance Estimation
SW implementation
Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed
Codesign of Embedded Systems
HW implementation
Communication
Winter-Spring 2001
26
Algorithm Steps
w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks
implemented on the ASIC It(b): Total number of times that b is executed
Codesign of Embedded Systems
Winter-Spring 2001
27
Experimental Results
Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation
Codesign of Embedded Systems
Cure:
Result:
Winter-Spring 2001
28
HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized
Winter-Spring 2001
29