You are on page 1of 32

Des Autom Embed Syst (2011) 15: 19–50

DOI 10.1007/s10617-010-9068-9

An integrated high-level hardware/software partitioning


methodology

M.B. Abdelhalim · S.E.-D. Habib

Received: 22 February 2010 / Accepted: 29 December 2010 / Published online: 8 February 2011
© Springer Science+Business Media, LLC 2011

Abstract Embedded systems are widely used in many sophisticated applications. To speed
the time-to-market cycle, the hardware and software co-design has become one of the main
methodologies in modern embedded systems. The most important challenge in the embed-
ded system design is partitioning; i.e. deciding which modules of the system should be
implemented in hardware and which ones in software. Finding an optimal partition is hard
because of the large number and different characteristics of the modules that have to be
considered.
In this article, we develop a new high-level hardware/software partitioning methodology.
Two novel features characterize this methodology. Firstly, the Particle Swarm Optimiza-
tion (PSO) technique is introduced to the Hardware/Software partitioning field. Secondly,
the hardware is modeled using two extreme implementations that bound different hardware
scheduling alternatives. Our methodology further partitions the design into hardware and
software modules at the early Control-Data Flow Graph (CDFG) level of the design; thanks
to improved modeling techniques using intermediate-granularity functional modules. A new
restarting technique is applied to PSO to avoid quick convergence. This technique is called
Re-Excited PSO. Our numerical results prove the usefulness of the proposed technique.
The target technology is Field Programmable Gate Arrays (FPGAs). We developed
FPGA-based estimation techniques to evaluate the costs of implementing the design compo-
nents. These costs are the area, delay, latency, and power consumption for both the hardware
and software implementations. Hardware/software communication is also taken into consid-
eration.

This research is an extended version of IESS 2007 conference paper [5].


M.B. Abdelhalim ()
College of Computing and Information Technology, Arab Academy of Science and Technology and
Maritime Transport, P.O. Box 2033, Cairo, Egypt
e-mail: mbakr@ieee.org

S.E.-D. Habib
Electronics and Communications Department, Faculty of Engineering, Cairo University, Giza, Egypt,
12613
e-mail: seraged@ieee.org
20 M.B. Abdelhalim et al.

The aforementioned methodology is embodied in an integrated CAD tool for hard-


ware/software co-design. This tool accepts behavioral, un-timed, algorithmic-level, VHDL,
design representation, and outputs a valid hardware/software partition and schedule for the
design subject to a set of area/power/delay constraints. This tool is code named CUPSHOP
for (Cairo University PSo-based Hardware/sOftware Partitioning tool). Finally, a JPEG-
encoder case study is used to validate and contrast our partitioning methodology against
the prior-art methodologies.

Keywords Hardware/software partitioning · Particle swarm optimization · Control-data


flow graphs · FPGAs · Hardware/software co-design · High-level design

1 Introduction

Embedded systems typically consist of application specific hardware parts, i.e. FPGAs or
ASICs, and programmable parts, i.e., processors like DSPs or ASIPs. In comparison to the
hardware parts, the software parts are much easier and faster to develop and modify. Thus,
software is less expensive in terms of the development cost and time. Hardware, however,
provides better performance. For this reason, a system designer’s goal is a system that mini-
mizes the weighted sum of the software delay, Hardware area, and power consumption. The
weights are determined by the user according to the design preferences.
HW/SW co-design deals with the problem of designing embedded systems, where au-
tomatic partitioning is one key issue. This article describes a new approach in automatic
hardware/software partitioning for single-core systems. The approach is based on the Parti-
cle Swarm Optimization (PSO) technique to solve the partitioning problem. This article is
based on a work that, to the best of our knowledge, was the first to introduce the Particle
Swarm optimization for HW/SW partitioning problem [4]. Several researchers have previ-
ously developed several PSO restarting techniques to avoid stagnation at local minima. We
propose a modified PSO restarting technique, named the Re-excited PSO algorithm.
Our main target is to offer an alternative and efficient methodology for the hard-
ware/software partitioning problem. Our methodology acts as an early discriminator as it
helps the designer to reduce the design size by fixing the mapping of some of the design
components according to the results of our partitioning tool. To reach this goal, quick and
fast estimation techniques are proposed to model accurately the different characteristics of
the design components in the early stages of the design.
The article is organized as follows: In Sect. 2, we introduce the HW/SW partitioning
problem. Section 3 introduces the Particle Swarm Optimization technique adopted to solve
this problem. Section 4 introduces several extensions to our methodology, namely, hardware
implementation alternatives, and HW/SW communications modeling. Section 5 overviews
the proposed integrated tool and the target architecture. Section 5 also contains a preview of
High-level estimation techniques used to produce early implementation costs for hardware,
software, as well as HW/SW interface. Section 6 overviews the case study used to prove the
quality of our methodology. Finally, Sect. 7 draws the conclusions of our work.

2 HW/SW partitioning

The most important challenge in the embedded system design is partitioning; i.e. deciding
which components of the system should be implemented in hardware and which ones in soft-
ware. Finding an optimal partition is hard because of the large number and different charac-
teristics of the components to be considered. The ideal Hardware/Software partitioning tool
An integrated high-level hardware/software partitioning methodology 21

produces automatically a set of high-quality partitions in short and predictable computation


time and allows the designer to interact with the tool flow.
De Souza et al. [20] defined a set of features for optimized partitioning algorithms. They
classified each algorithm according to the following criterion:
1. Application domain: whether the partitioning algorithm is “multi-domain” (conceived
for more than one or any application domain, thus not considering particularities of these
domains and being technology-independent) or is “specific domain” algorithm.
2. The target architecture type.
3. Consideration of the HW-SW communication costs.
4. Possibility of choosing the best implementation alternative of HW nodes.
5. Possibility of sharing HW resources among two or more nodes.
6. Exploitation of HW-SW parallelism.
7. Single-mode or multi-mode systems.
Obtaining the optimum HW/SW partition requires an efficient cost function that directs
the solution search process. The cost function should also be able to handle constraints
to provide feasible solution that meets all design specifications. Moreover, early and fast
pre-synthesis cost function estimation is critical to the success of any HW/SW partitioning
algorithm. Based on the above criterion, our HW/SW partitioning tool is built.

2.1 Cost function

Cost function plays an important role in HW/SW Partitioning process. Its importance comes
from the fact that HW/SW Partitioning problem contains contradicting goals, in which cost
function is essential to determine which term(s) has (have) the greatest importance to the
designer. Examples of design terms are hardware area, delay, power consumption, time to
market, design flexibility, etc. Cost function should take, at its input, a proposed partitioned
solution and returns a measure of how good this solution is relative to the considered cost
term(s). The hardware/software partitioning tool should also have the flexibility that allows
the designer to identify whether one or more cost terms should be used as an optimization
target.
In this article, three cost terms are considered, namely:
Area cost: which is the cost of implementing a component in hardware (e.g. number of
gates, area in mm2 , number of logic elements, . . . etc.) or in software (e.g., CPU area).
Delay cost: which is the delay of a certain implementation of a component (e.g., execu-
tion delay or number of clock cycles). We consider both the HW and SW delays.
Power cost: which is the power consumption of a component. We consider both the HW
and SW power consumptions.

2.2 Constraints definition

In embedded systems, the constraints play an important role in the success of a design,
where hard constraints mean higher design effort and therefore a high need for automated
tools to guide the designer in critical design decisions. In most cases, the constraints are
mainly the software deadline times (for real-time systems) and the maximum available area
for hardware. For simplicity, we will refer to them as software constraint and hardware
constraint respectively.
Mann [49] divided the HW/SW partitioning problem into 5 sub-problems (P1 through P5 ).
P1 problem deals with both hardware and software constraints where both constraints must
22 M.B. Abdelhalim et al.

be satisfied. P2 (P3 ) problem tries to satisfy hardware (software) constraints while minimiz-
ing software (hardware) cost. P4 problem minimizes HW/SW communications cost while
satisfying hardware and software constraints. Finally, P5 problem handles the unconstrained
partitioning problem where the goal of the partitioning process is to minimize both hardware
and software costs. The constraints affect directly the cost function. Hence, any formulation
of the cost function should consider constraints violations. Such consideration is usually
achieved by adding one or more terms to the raw cost function that is used to guide the
unconstrained partitioning problem.
Cost functions may be modified to account for design constraints. Three different tech-
niques for such modification are given in Lopez-Vallejo et al. [45], namely:
Mean Square Error minimization: This technique is useful for forcing the solution to
meet certain equality, rather than inequality, constraints.
Penalty Methods: These methods punish the solutions that produce medium or large con-
straints violations, but allow invalid solutions close to the boundaries defined by the con-
straints to be considered as good solutions.
Barrier Techniques [46]: which forbid the exploration of solutions outside the allowed
design space. The barrier techniques rank the invalid solutions worse than the valid ones.
There are two common forms of the barrier techniques. The first form assigns a constant
high cost to all invalid solutions (for example infinity). This form is unable to differentiate
between near-barrier and far-barrier invalid solutions. The partitioning algorithm also needs
to be initialized with at least one valid solution, otherwise all the costs are the same (i.e. ∞)
and the algorithm fails. The other form, suggested by Mann [49], assigns a constant-base
barrier to all invalid solutions. This base barrier could be a constant larger than maximum
cost produced by any valid solution.

2.3 Search algorithms

Traditionally, partitioning was carried out manually as in the work of Marrec et al. [50].
However, because of the increase of complexity of the systems, numerous research efforts
were directed to automate the partitioning as much as possible. The suggested partition
approaches differ significantly according to the definitions they used to the problem. One
of the main differences is whether to include other tasks (such as scheduling where starting
times of the components should be determined) as in Lopez-Vallejo and Lopez [45] and
in Mei et al. [51], or just map components to hardware or software only as in the work of
Vahid [60] and Madsen et al. [48]. Some formulations assign communication events to links
between hardware and/or software units as in Jha and Dick [32].
The system to be partitioned is generally given in the form of task graph, the graph nodes
are determined by the model granularity, i.e. the semantic of a node. The node could rep-
resent a single instruction, short sequence of instructions [57], basic block [38], a function
or a procedure [13, 21]. A flexible granularity may also be used where a node can represent
any of the above [29, 60].
Regarding the suggested algorithms, one can differentiate between exact and heuristic
methods. Exact algorithms include, but are not limited to, branch-and-bound [14], dynamic
programming [48], and integer linear programming [21, 52]. Due to the slow performance
of the exact algorithms, heuristic-based algorithms are proposed. In particular, Genetic al-
gorithms are widely used [49, 52, 63] as well as simulated annealing [13, 24], hierarchi-
cal clustering [24], United Evolutionary Algorithm Scheme [59] and Kernighan-Lin based
algorithms [49]. Less popular heuristics are used such as Tabu search [24] and greedy al-
gorithms [15]. Some researchers used custom heuristics, such as MFMC [49], GCLP [35],
process complexity [6], the expert system presented in [45], and BUB [56].
An integrated high-level hardware/software partitioning methodology 23

2.4 Present challenges facing HW/SW partitioning

Currently, automatic HW/SW partitioning is not widely used in the industrial and commer-
cial tools, and is limited to in-house tools or academic ones. From our point of view, such
situation, despite the huge research efforts in the last two decades, comes as a result of
two main problems, namely, inefficient search algorithms and large search space of current
designs.
Field-Programmable Gate Arrays (FPGAs) have become increasingly popular because
recent trends indicate a faster growth of their transistor density than even general-purpose
processors. This high logic density plus the field programmability offers an inexpensive cus-
tomized VLSI implementation of circuits with fast turnaround time, making FPGAs very
lucrative as design platforms. In the last decade, FPGA vendors started to include a mi-
croprocessor core(s) inside their FPGAs. The processors are included as hardwired cores
(PowerPC 405 processor in Xilinx Virtex-II pro, Virtex-4 and Virtex-5 devices [61] as well
as ARM 922T processor in Altera Excalibur devices [7]) or soft cores (Micro-blaze core in
Xilinx [61] as well as NIOS-II core in Altera [9]). Such inclusion has made FPGA perfect
platforms for HW/SW partitioning because it facilitates the design process, the interface
synthesis, co-debugging and co-verification of the designs to be partitioned. However, the
lack of FPGA-based automatic HW/SW partitioning tools complicates the FPGA-based co-
design process.
In this article, we tackled the first problem (inefficient search algorithms) via using Par-
ticle Swarm Optimization (PSO) as the search algorithm. Section 3 explains how we adopt
this algorithm for solving the HW/SW partitioning problem. In Sect. 4, we present our con-
tribution to the solution of the second problem (large search space problem) by introducing
two new approaches, namely, HW implementation alternatives and intermediate granularity
approaches.
We target ALTERA Cyclone FPGAs [10] as a hardware platform in which the software
is implemented on a soft-core processor, namely NIOS-II. Such selection is used only as an
implementation example. However, our methodology can be easily extended to other types
of FPGAs and/or processors.

3 Particle swarm optimization

Particle swarm optimization (PSO) is a population-based stochastic optimization technique


developed by Eberhart and Kennedy in 1995 [22, 23, 37]. The PSO algorithm is inspired
by social behavior of bird flocking, animal hording, or fish schooling. In PSO, the potential
solutions, called particles, fly through the problem space by following the current optimum
particles. PSO has been successfully applied in many areas. A comprehensive bibliography
of PSO applications could be found in the work of Poli [53].

3.1 PSO algorithm

As stated before, PSO simulates the behavior of bird flocking. Suppose the following sce-
nario: a group of birds is randomly searching for food in an area and there is only one piece
of food in the area being searched. Not all of the birds know where the food is. However, dur-
ing every iteration, they learn via their inter-communications how far the food is. Therefore,
the best strategy to find the food is to follow the bird that is nearest to the food.
PSO learned from this bird-flocking scenario, and used it to solve the optimization prob-
lems. In PSO, each single solution is a “bird” in the search space and is called a “particle”.
24 M.B. Abdelhalim et al.

Fig. 1 PSO flow chart

All particles have fitness (cost) values which are evaluated by the fitness (cost) function to
be optimized, and have velocities which direct their fleight. The particles fly through the
problem space by following the current optimum particles.
PSO is initialized with a group of random particles (solutions) and then searches for op-
tima by updating generations. During every iteration, each particle is updated by following
two “best” values. The first one is the position of the best solution this particle has achieved
so far. This position is called pbest and is stored in the memory of each particle. Another
“best” position that is tracked by the particle swarm optimizer is the best position obtained
so far by any particle in the population. This best position is the current global best and is
called gbest and can be considered as the best pbest solution obtained by the particles.
After finding the two best values, the particle updates its velocity and position according
to (1) and (2) respectively.

i
vk+1 = wvki + c1 r1 (pbesti − xki ) + c2 r2 (gbestk − xki ) (1)
i
xk+1 = xki + vk+1
i
(2)
An integrated high-level hardware/software partitioning methodology 25

where vk+1i
is the velocity of i th particle at the (k + 1)th iteration, xki is the i th particle
current solution (or position). r1 and r2 are uniform random numbers in the range between
0 and 1. c1 is the self-confidence (cognitive) factor; c2 is the swarm confidence (social)
factor. Usually c1 and c2 are in the range from 1.5 to 2.5. Finally, w is the inertia factor that
takes linearly decreasing values downward from 1 to 0 according to a predefined number of
iterations as recommended by Haupt and Haupt [28].
The 1st term in (1) represents the effect of the inertia of the particle, the 2nd term repre-
sents the particle memory influence, and the 3rd term represents the swarm (society) influ-
ence. The flow chart of the procedure is shown in Fig. 1.
The velocities of the particles on each dimension may be clamped to a maximum ve-
locity Vmax , which is a parameter specified by the user. If the sum of accelerations causes
the velocity on that dimension to exceed Vmax , then this velocity is limited to Vmax [28].
Another type of clamping is to clamp the position of the current solution to a certain range
in which the solution has valid values, otherwise the solution is meaningless [28]. In this
article, position clamping is applied with no limitation on the velocity values.

3.1.1 Algorithm implementation—simple test case

The PSO algorithm is implemented in the MATLAB program environment. The input to the
algorithm is a design that consists of the number of nodes. Each node is associated with cost
parameters. For experimental purpose, the heterogeneity of the design nodes is handled by
using random cost values that are uniformly distributed over a defined range for each design
parameter, i.e., hardware, software, and power cost values.
The used cost parameters are:
A Hardware implementation cost: uniformly and randomly generated in the range from 1
to 99 [49].
A Software implementation cost: uniformly and randomly generated in the range from 1 to
99 [49].
A Power implementation cost: uniformly and randomly generated in the range from 1 to 9.
We use a different range for Power consumption values to test the addition of other cost
terms with different range characteristics.
Consider a design consisting of m nodes. A possible solution (particle) is a vector of m
elements, where each element is associated to a given node. The elements assume a “0” value
(if node is implemented in software) or a “1” value (if the node is implemented in hardware).
There are n initial particles (solutions) that are initialized randomly. The velocity of each
node is initialized in the range from (−1) to (1), where negative velocity means moving the
particle toward 0 and positive velocity means moving the particle toward 1.
For the main loop, (1), (2) are evaluated in each loop. If the particle goes outside the
permissible region (position from 0 to 1), it will be kept on the nearest limit by the afore-
mentioned clamping technique. The used cost function is a normalized weighted sum of the
hardware, software, and power cost of each particle according to (3).
 
HW cos t SW cos t POWER cos t
Cost = 100 × α +β +γ (3)
allHW cos t allSW cos t allPOWER cos t

where allHWcost (allSWcost) is the Maximum Hardware (Software) cost when all nodes
are mapped to hardware (software), and allPOWERcost is the average of the power cost of
all-hardware solution and all-software solution. α, β, and γ are weighting factors that are
26 M.B. Abdelhalim et al.

set by the user according to his/her critical design parameters. For the rest of this article, all
these weighting factors are set to 1 unless otherwise mentioned. The multiplication by 100
is for readability only. The HWCost (SWCost) term represent the cost of the partition im-
plemented in hardware (software); it could represent the area and the delay of the partition.
However, the software cost has a fixed term (CPU area) that is independent of the problem
in hand.
The algorithm continues as shown in Fig. 1. For simplicity, the cost value could be con-
sidered as the negation of the fitness where good solutions have low cost values.
According to (1) and (2), the particle nodes values could take any value between 0 and 1.
However, as a binary, partitioning problem, the nodes values must take values of 1 or 0.
Therefore, the position value is rounded to the nearest integer [27]. The main loop is termi-
nated when the improvement in the global best solution (gbest) remains less than a prede-
fined threshold value (ε) for a pre-defined number iterations. The number of these iterations
and the value of (ε) are user controlled parameters.
The following experiment is performed on a Pentium-4 PC with 3 GHz processor speed,
1 GB RAM, and WinXP operating system. The experiment was performed using MATLAB
7 program. Experiment parameters are as follows:
No. of particles (Population size) n = 60, No. of design nodes m = 512 which can be
considered as a big design, ε = 100 × eps, where eps is defined in MATLAB as a very small
(numerical resolution) value and equals 2.2204 × 10−16 [26]. c1 = c2 = 2, and w starts at
1 and decreases linearly until reaching 0 after 100 iterations. Those values are suggested in
[54, 55, 62].
For more details about the results as well as a comparison between PSO and GA, inter-
ested readers can refer to our work in [4] and [5]. In general, PSO outperforms GA in all
cases. However, PSO suffers from premature convergence to local optima, where the opti-
mization process is stagnated and no improvements are obtained by running the algorithm
further. A direct solution is to cascade PSO and GA, however, this approach gave no no-
ticeable improvement is noticed. Hence, PSO is cascaded with itself as is explained in the
following subsection.

3.2 Successive PSO (Re-excited PSO) algorithm

As PSO proceeds, the effect of the inertia factor (w) is decreased until it reachies 0. There-
i
fore, vk+1 at the late iterations depends only on the particle memory influence and the swarm
influence (2nd and 3rd terms in (1)). Hence, the algorithm may converge to a local optimum
where the 2nd and 3rd termed are converged to zeros and the exploration process is stopped.
Therefore, a modified PSO algorithm is proposed. This algorithm is based on the assump-
tion that if we save the run’s final results (particles positions), start all over a new round
of iterations with (w) = 1, re-initialize the velocity (v) with new random values, and keep
the pbest and gbest vectors in the particles memories; then, the results can be improved.
We found that the result quality was improved with each new round until it settles around a
certain value.
Figure 2 plots the gbest cost in each round. The curve starts with cost ∼133 (the result of
classical single-round PSO) and settles at round number 30 with cost value ∼116.5 which
is significantly below the results obtained using single-round PSO (about 15% quality im-
provement). The algorithm performed 100 rounds but it could be modified to stop earlier by
using a termination criterion (i.e., if the result remains unchanged for a certain number of
rounds).
This new algorithm depends on re-exciting new randomized particle velocities at the
beginning of each round, while keeping the particle positions obtained so far; thus, it allows
An integrated high-level hardware/software partitioning methodology 27

Fig. 2 Successive improvements


in Re-excited PSO

another round of domain exploration. We propose to name this successive PSO algorithm
as the Re-excited PSO algorithm. In nature, this algorithm looks like giving the birds a big
push after they are settled in their best position. This push re-initializes the inertia and speed
of the birds so they are able to explore new areas unexplored before. Hence, if the birds find
a better place, they will go there; otherwise they will return to the place from which they
were pushed.

3.3 Constrained problem formulation

3.3.1 Constraints modeling

As discussed in Sect. 2.2, modeling of the constraints is an important issue to any HW/SW
partitioning tool. Its importance lies in the fact that current embedded systems are bounded
with several hard constraints, i.e., area, cost, speed, power, etc.
As mentioned earlier, two techniques are widely used for modeling of constraints,
namely, penalty methods and barrier methods. Hence, in the following experiments, (4) is
used as the cost function. The first term in (4) represents the unconstrained cost function (3).
The second term represents the cost added due to constraint violations.
 cos ti (x) 
Cost(x) = ki × + kci (Penalty_viol(ci, x) + Barrier_viol(ci)) (4)
i
Total cos ti ci

where x is the solution vector to be evaluated, ki and kci are weighting factors (100 in our
case), i denotes the design parameters such as: area, delay, power consumption, etc., ci
denotes the constrained design parameters, Penalty_viol (ci, x) is the correction function
of the constrained parameters ci. Penalty_viol (ci, x) could be expressed in terms of the
percentage of violation defined by:

(cos ti − constrainti )
percentage of violation(i) = (5)
constrainti
Penalty_viol (ci, x) could be the percentage of violation as in Mann [49], its squared
value [45], or any other suitable function of the percentage of violation. Note that
Penalty_viol (ci, x) will be zero if the constrained parameter is not violated.
Barrier_viol (ci) is the constant value added to the cost function if ci is violated. It is
used to guarantee that no invalid solutions surpass valid ones. The constant could be infinity
28 M.B. Abdelhalim et al.

(infinity-based) or a high constant (constant-based). For constant-based barrier technique,


we add 1 for each violated design parameter. This value was chosen to be higher than the
normalized cost of the design terms of (3).
Our experiments on single-constrained problem instances—P2 and P3 —showed that
combining the constant-based barrier method with any penalty method (first-order error
or second-order error term) gives higher quality solutions and guarantees that no invalid
solutions surpass valid ones. Our experiments further indicate that the second-order error
penalty method gives a slight improvement over the first-order one [5]. We used the above
conclusions and applied them on P1 problem instance to compare between GA, PSO and
re-excited PSO performance in the presence of double constraints. The used parameters are
the same as in the unconstrained case and re-excited PSO is performed for 10 rounds only.

3.3.2 Experiments

When testing P1 (double-constraints) problem instance, two experiments were performed:


balanced constraints where maximum allowable hardware area is 45% of the area of an all-
hardware solution and the maximum allowable software delay is 45% of the delay of an
all-software solution. The other one is an unbalanced-constraints problem where maximum
allowable hardware area is 60% of the area of an all-hardware solution and the maximum
allowable software delay is 20% of the delay of an all-software solution. Note that these
constraints are used to guarantee that a valid solution exists.
For the first experiment; the best valid solution cost of GA is 137 with an average cost of
all best solutions around 158. Invalid solutions are obtained during 22 of the runs. For PSO,
the best valid solution cost was 128.6 with an average cost of all best solutions around 131
with valid solutions obtained during all the runs. Finally, re-excited PSO obtained a final
solution quality of 119.5. It is clear that re-excited PSO still outperforms both PSO and GA.
For the second experiment; the average quality of the solution of GA is around 287 and
no valid solution is obtained during the runs (100 is added to the cost as a constant-base
penalty as there is always violations in delay constraints). For PSO the average quality is
around 251 with no valid solution is obtained during the runs. Finally, for the re-excited
PSO, the final solution quality is 125 (as valid solution is found in the seventh round). This
shows the performance improvement of re-excited PSO over both PSO and GA.

3.4 Real-life case study

To test the algorithm in a real-life case study, a comprehensive literature search was per-
formed to find a published work with full details about the cost function of the case study and
partitioning results. Lee et al. [41] provided such details for a case study of the well-known
Joint Picture Expert Group (JPEG) encoder system. The hardware implementation is written
in “Verilog” description language, while the software is written in “C”. The Control-Data
Flow Graph (CDFG) for this implementation is shown in Fig. 3. The authors pre-assumed
that RGB to YUV converter is implemented in SW and is not subject to the partitioning
process. For more details regarding JPEG systems, interested readers can refer to Jons-
son [34].
Lee et al. [41] provided measured data for the considered cost metrics of the system
components. These measurements are shown in Table 1. The data is obtained through im-
plementing the hardware components targeting ML310 board using Xilinx ISE 7.1i design
platform. Xilinx Embedded Design Kit (EDK 7.1i) is used to measure the software imple-
mentation costs. The target board (ML310) contains Virtex2-Pro XC2vP30FF896 FPGA
An integrated high-level hardware/software partitioning methodology 29

Fig. 3 CDFG for JPEG


encoding system [41]

device that contains 13696 programmable logic slices and 2448 Kbytes memory bits and
two embedded IBM Power PC (PPC) processor cores. In general, one slice approximately
represents two 4-input Look-Up Tables (LUTs) and two Flip-Flops [61]. The first column
in Table 1 shows the component name along with an ID for each component. The second
and third columns show the power consumption in mW for the hardware and software im-
plementations respectively. It is worthy to note that the hardware power considered is the
switching power, neither the static of the FPGA nor the whole power of the JPEG design.
The fourth column shows the software cost in terms of memory usage percentage while
the fifth column shows the hardware cost in terms of slices percentage. The percentages
used in the fourth and fifth column are calculated according the available resources in the
used FPGA. The last two columns show the execution time of the hardware and software
implementations respectively.
Lee et al. [41] also provided detailed comparison of their methodology with four other
techniques. The main problem is that the target architecture in Lee et al. [41] has two proces-
sors and allows multi-processor partitioning while our target architecture is based on a sin-
gle processor. A slight modification in our cost function is performed that allows up to two
processors to run on the software part concurrently. Also, the cost function is also modified
to account for the needed SW memory. Equation (4) is used with the several modifications.
The first term of (4) is modified by adding a memory size term as shown in (6)
30 M.B. Abdelhalim et al.

Table 1 Measured data for JPEG system [41]

Component Power Consumption Cost Percentage Execution Time


SW (mw) HW (mw) SW (10−3 ) HW (10−3 ) SW (us) HW (ns)

a (Level Offset) 0.096 4 0.58 7.31 9.38 155.264


b (DCT) 45 274 2.88 378 20000 1844.822
c (DCT) 45 274 2.88 378 20000 1844.822
d (DCT) 45 274 2.88 378 20000 1844.822
e (Quant.) 0.26 3 1.93 11 34.7 3512.32
f (Quant.) 0.27 3 1.93 9.64 33.44 3512.32
g (Quant.) 0.27 3 1.93 9.64 33.44 3512.32
h (DPCM) 0.957 15 0.677 2.191 0.94 5.334
i (Zigzag) 0.069 61 0.911 35 13.12 399.104
j (DPCM) 0.957 15 0.677 2.191 0.94 5.334
k (Zigzag) 0.069 61 0.911 35 13.12 399.104
l (DPCM) 0.957 15 0.677 2.191 0.94 5.334
m (Zigzag) 0.069 61 0.911 35 13.12 399.104
n (VLC) 0.321 5 14.4 7.74 2.8 2054.748
o (RLE) 0.021 3 6.034 2.56 43.12 1148.538
p (VLC) 0.321 5 14.4 8.62 2.8 2197.632
q (RLE) 0.021 3 6.034 2.56 43.12 1148.538
r (VLC) 0.321 5 14.4 8.62 2.8 2197.632
s (RLE) 0.021 3 6.034 2.56 43.12 1148.538
t (VLC) 0.018 6 16.7 19.21 51.26 2668.288
u (VLC) 0.018 6 16.7 1.91 50 2668.288
v (VLC) 0.018 6 16.7 1.91 50 2668.288

 
HW cos t SW cos t POWER cos t MEM cos t
Cost = 100 × α +β +γ +η (6)
allHW cos t allSW cos t allPOWER cos t allMEM cos t
The added memory cost term (MEMcost) and its weight factor (η) account for the memory
size (in bits). allMEMcost is the maximum size (upper-bound) of memory bits i.e., mem-
ory size of all software solution. Second, a barrier violation term is added to account for
the maximum number of parallel processors. It adds “one” to the cost value if there are
more than two software nodes in the same control step since the cited results target multi-
processor systems with only the two running processors. Finally, as more than one hardware
component could run in parallel, the hardware delay is not additive. Hence, we calculate
the hardware delay by accumulating the maximum delay of each control steps as shown in
Fig. 3. In other words, we calculate the critical-path delay.
In Lee et al. [41], the results of four different algorithms were presented. However, for
the sake brevity, details of such algorithms are beyond the scope of this article. We used
these results and compared them with our algorithm in Table 2.
In our experiments, the parameters used for the PSO are as follows: the population size
is fixed to 50 particles, a round terminates after 50 unimproved iterations, but 100 iterations
must run at the beginning to avoid early trapping in local minimum. The number of re-
excited PSO rounds is selected by the user. The power assumption is constrained to be
An integrated high-level hardware/software partitioning methodology 31

Table 2 Comparison of partitioning results

Method Results Execution Memory Slice use Power


Lev/DCT/Q/DPCM-Zig/VLC-RLE/VLC Time (us) (KB) (%) (mW)

FBP [41] 1/001/111/101111/111101/111 20022.26 51.58 53.9 581.39


GHO [40] 1/010/111/111110/111111/111 20021.66 16.507 54.7 586.069
GA [44] 0/010/010/101110/110111/010 20111.26 146.509 47.1 499.121
HOP [39] 0/100/010/101110/110111/010 20066.64 129.68 56.6 599.67
PSO-del 1/010/111/111110/111111/111 20021.66 16.507 54.7 586.069
PSO-a 0/100/001/111010/110101/010 20111.26 181.6955 44.7 494.442
PSO-p 0/100/001/111010/110101/010 20111.26 181.6955 44.7 494.442
PSO-mem 1/010/111/111110/111111/111 20021. 66 16.507 54.7 586.069
PSO-NoProc 0/000/111/000000/111111/111 20030.9 34.2328 8.6 189.174
PSO-Norm 0/010/111/101110/111111/111 20030.9 19.98 50.6 521.234

less or equal 600 mW, area and memory are constrained to the maximum available FPGA
resources, i.e. 100%, and finally maximum number of concurrent software tasks is two.
Different configurations of the cost function are tested for different optimization goals.
PSO-del, PSO-a, PSO-p, or PSO-mem represent the case where the cost function includes
only one term, i.e., delay, area, power, or memory, respectively. PSO-NoProc is the normal
PSO-based algorithm with the cost function shown in (4) but the number of processors is
unconstrained. Finally, PSO-Norm is the normal PSO with all constraints considered, i.e.,
the same as PSO-NoProc with maximum number of two processors.
The second column in Table 2 shows the resulting partition where ‘0’ represents software
and ‘1’ represents hardware. The vector is divided into sets; each set represents a level as
shown in Fig. 3.
Regarding PSO performance, all the PSO-based results are found within two or three
rounds of the re-excited PSO. Moreover, for each individual optimization objective, PSO
obtains the best result for that specific objective. As shown in the table, the bold results are
the best results obtained for each design metric. For example, PSO-del obtains the same
results as GHO. Moreover, it outperforms the other three algorithms in the execution time
and memory utilization and it produces good quality results that meet the constraints. Hence,
our cost function formulation enables us to easily select the optimization criterion that suits
our design goals. In addition, PSO-a and PSO-p gives the same results as they try to move
nodes to software while meeting the power and number of processors constraints. On the
other hand, PSO-del and PSO-mem tries to move nodes to hardware to reduce the memory
usage, so their results are similar. PSO-NoProc is used as a what-if analysis tool, as its results
answer the question of “what is the optimum number of parallel processors that could be
used to find the optimum design?” In our case, obtaining six processors would yield the
results shown in the table even if three of them will be used only for one task, namely,
the DCT. Finally, PSO-Norm is used when all constraints are considered. As shown in the
table, the results are by no means the best per each cost metric, in other words, it does not
give the best area or best power. However, it provides excellent balance with respect to all
contradicting goals.
32 M.B. Abdelhalim et al.

4 Algorithm extensions

4.1 Modeling hardware implementation alternatives

As shown previously, HW/SW partitioning depends on the HW area, delay, and power costs
of the individual nodes. Each node represents a grain (from an instruction up to a procedure),
and the grain level is selected by the designer. The initial design is usually mapped into a
sequencing graph that describes the flow dependencies of the individual nodes. These depen-
dencies limit the maximum degree of parallelism possible between these nodes. Whereas a
sequencing graph denotes the partial order of the operations to be performed, the scheduling
of a sequencing graph determines the detailed starting time for each operation. Hence, the
scheduling task sets the actual degree of concurrency of the operations, with the attendant
delay and area costs [19]. In short, delay and area costs needed for the HW/SW partitioning
task are only known accurately post the scheduling task. Obviously, this situation calls for
time-wasteful iterations.
The other solution is to prepare a library of many implementations for each node and
select one of them during the HW/SW partitioning task as the work done by Kalavade and
Lee [36]. Again, such approach implies a high design time cost. Our approach to this cou-
pling between the partitioning and scheduling tasks is as follows: represent the hardware
solution of each node by two limiting solutions, HW1 and HW2 , which are automatically
generated from the functional specifications. These two limiting solutions bound the range
of all other possible schedules. The partitioning algorithm is then called on to select the best
implementation for the individual nodes: SW, HW1 or HW2 . These two limiting solutions
are:
1. Minimum-Latency solution: where As-Soon-As-Possible (ASAP) scheduling algorithm
is applied to find the fastest implementation by allowing unconstrained concurrency. This
solution allows for two alternative implementations, the first where maximum resource-
sharing is allowed. In this implementation, similar operational units are assigned to the
same operation instance whenever data precedence constraints allow. The other solution,
the non-shared parallel solution, forbids resource-sharing altogether by instantiating a
new operation instance for each operational unit. Which of these two parallel solutions
yields a lower area is difficult to predict as the multiplexer cost of the shared parallel
solution, added to control the access to the shared instances, can offset the extra area cost
of the non-shared solution. Our modeling technique selects the solution with the lower
area. This solution is, henceforth, referred to as the parallel hardware solution.
2. Maximum Latency solution: where no concurrency is allowed, or all operations are sim-
ply serialized. This solution results in the maximum hardware latency and the instantia-
tion of only one operational instance for each operation unit. This solution is, henceforth,
referred to as the serial hardware solution.
To illustrate our idea, consider a node that represents the operation y = (a × b) + (c × d).
Figure 4(a) (4(b)) shows the parallel (serial) hardware implementations.
From Fig. 4, and assuming that each operation takes only one clock cycle, the first im-
plementation finishes in 2 clock cycles (computation cycle size is 2) but needs 2 multiplier
units and one adder unit. The second implementation ends in 3 clock cycles (computation
cycle size is 3) but needs only one unit for each operation (one adder unit and one multiplier
unit). The bold horizontal lines drawn in Fig. 4 represent the clock boundaries.
In general, in our methodology, the nodes could be classified into four main classes:
An integrated high-level hardware/software partitioning methodology 33

Fig. 4 Two extreme implementations of y = (a × b) + (c × d)

1. Normal nodes: such as the node shown in Fig. 4 where the two extreme implementations
give different results for both area and latency. In addition, as the latency increases, the
area decreases.
2. Constant nodes: where both latency and area are constants. This case implies that there
is no resource sharing and the graph is sequential with no parallelism. For example, let
y = ((a × b) + c)/d. The area is always constant (one adder, one multiplier, and one
divider units) while the latency is also constant (3 cycles).
3. Constant Area nodes: where the two implementations differ only in the latency while
the area is constant. For example, let y = (a × b) + (c/d). The area is always constant
(one adder, one multiplier, and one divider units) while the latency could be 2 or 3 cycles
according to the allowed concurrency. Our tool considers only the faster implementation
as the unique implementation of this node, i.e., no alternatives.
4. Constant Latency nodes: where the two implementations differ only in the area while
the latency is constant. For example, let y = ((a × b) + c) × d. The latency is always
constant (3 cycles) while the area depends on how many multipliers are instantiated (one
shared multiplier vs. two separate multipliers). Our tool considers only the lower area
implementation as the unique implementation of this node, i.e., no alternatives.
In conclusion, our methodology is modified in order to represent normal nodes using
their two extreme implementations, if applicable.

4.2 Communications cost modeling

The Communications cost term in the context of HW/SW partitioning represents the cost
incurred due to the data and control passing from one node to another in the graph represen-
tation of the design. The communication cost is relevant only if it is between two nodes on
different sides of the HW/SW partition (i.e., one hardware node and one software node) [33].
Therefore, as communications are based on physical channels, the nature of the channel de-
termines the communication type (class). In general, the HW/SW communications can be
classified into four classes [25]:
1. Point-to-point communications
2. Bus-based communications
3. Shared memory communications
4. Network-based communications
To model the communications cost, a communication class must be selected according to
the target architecture. In general, the model should include one or more from the following
cost terms [47]:
34 M.B. Abdelhalim et al.

Table 3 Cost result of different hardware alternatives schemes

Area Cost Delay Cost Comm. Cost Serial HW nodes Parallel HW nodes SW nodes

Serial HW 34.9% 30.52% 1.43% 99 N/A 1


Parallel HW 57.8% 29.16% 32.88% N/A 69 31
Proposed method 50.64% 23.65% 18.7% 31 55 14

1. Hardware cost: The hardware needed to implement the HW/SW interface and associated
data transfer delay on the hardware side.
2. Software cost: The delay of the software interface driver on the software side.
3. Memory size: The size of the dedicated memory and registers for control and data trans-
fers as well as shared memory size if used.
The terms could be easily modeled within the overall delay, hardware area and memory
costs of the system, as shown in (3).

4.3 Experiments

As described in Sect. 3.1.1, the input to the algorithm is a graph that consists of a number of
nodes and number of edges. Each node (edge) is associated with cost parameters. The used
cost parameters are:
Serial hardware implementation cost: the cost of implementing the node in serialized hard-
ware (i.e., area) as well as its associated latency (in clock cycles).
Parallel hardware implementation cost: the cost of implementing the node in parallel hard-
ware (i.e., area) as well as its associated latency (in clock cycles).
Software implementation cost: the cost of implementing the node in software (i.e., execu-
tion clock cycles and the CPU area).
Communication cost: the cost of the edge if it crosses the boundary between the HW and
the SW sides, i.e., interface area and delay, SW driver delay and shared memory size.
For experimental purposes, these parameters are randomly generated after considering
the characteristics of each parameter, i.e. (Serial HW area ≤ Parallel HW area), and (Par-
allel HW delay ≤ Serial HW delay ≤ SW delay). The needed modification to the original
algorithm is to allow each node in the PSO solution vector to have three values: “0” for
software, “1” for serial hardware and “2” for parallel hardware. Extending the number of
HW implementations to more than two is straightforward, however, the search space will
increase and an exponential growth in partitioning complexity is expected. The parameters
used in the implementation are: no. of particles (Population size) n = 50, no. of design
nodes (design size) m = 100, and the number of re-exited PSO rounds is set to a predefined
value = 50. All other parameters are the same as in Sect. 3.1.1. The constraints are: Max-
imum hardware area is 65% of the allHWcost area, and the maximum delay is 25% of the
allSWcost solution delay.
Table 3 shows the results for the PSO-based partitioning assuming either a serial or a
parallel implementation. The third row in Table 3 shows the results of the proposed scheme
where the hardware is represented by two alternatives; i.e., serial and parallel alternatives.
As shown in this table, the serial hardware solution pushes approximately all nodes to
hardware (99 out of 100) but fails to meet the deadline constraint due to the relatively large
delay of the serial HW implementations. On the other hand, the parallel HW solution fails
An integrated high-level hardware/software partitioning methodology 35

to meet the delay constraint due to the relatively large area of parallel HW. Moreover, It has
large communications cost. Finally, the proposed scheme meets the constraints and results
in a relatively low communication cost.

5 CUPSHOP tool flow and target architecture

Field-Programmable Gate Arrays (FPGAs) are becoming increasingly popular as recent


trends indicate a faster growth of their transistor density than even general-purpose proces-
sors. This high logic density plus the field programmability offers a very-lucrative, inexpen-
sive, customized, VLSI implementation platform. In the last decade, FPGA vendors started
to include microprocessor cores inside their FPGAs. Such inclusion has made FPGAs at-
tractive platforms for HW/SW partitioning as they facilitate the design process, the inter-
face synthesis, co-debugging and co-verification. However, there is a serious lack of FPGA-
based automatic HW/SW partitioning tools. For these reasons, we have selected FPGAs as
our platform.
As a proof of concept, we selected Cyclone FPGA devices as our target FPGA. The main
reason of our selection is the maturity of such devices, where datasheets are currently at
a stable state and are not going through frequent updates and modifications [10]. This fact
makes our models as accurate as possible. Our methodology can be easily modified to target
any other FPGA and processor technologies.
In Cyclone devices, the hardware components are implemented inside the programmable-
logic area. The programmable-logic area consists of arrays of logic Elements (LEs) that rep-
resent the finest resource unit. Each LE contains a 4-input LUT that could implement any
of 4-input logic functions. Each LE has its own programmable register. A register bypass
multiplexer allows the output from the LE to be combinatorial or registered or both. Finally,
each LE has a dedicated Carry logic to enhance the implementation of arithmetic opera-
tions [10]. On the other hand, the software components are implemented using the NIOS-II
processor core. NIOS-II processor is a general-purpose RISC processor core and is available
in three versions ranging from approximately 700 to 1800 required LEs, depending on the
desired CPU features, including caching, pipeline stages, and custom instructions. NIOS II
processor is a true 32-bit CPU with a customizable instruction set [9].
Finally, the communications between the software and hardware sides are implemented
using a shared bus scheme. The used shared bus is the Avalon bus. The Avalon bus is an in-
terface that specifies the port connections between master and slave components, and spec-
ifies the timing by which these components communicate. The principal design goals of
the Avalon bus are simplicity, optimized resource utilization for bus logic, and synchronous
operation [8].
Figure 5 illustrates the target architecture used in our tool. As shown in this figure, many
HW components could run in parallel. Moreover, each HW component could have a dedi-
cated local memory to reduce shared memory bottlenecks.
In the following subsections, the integrated tool flow is presented followed by the tech-
niques used to estimate the implementation costs of HW, SW, and HW/SW communications.

5.1 CUPSHOP tool

CUPSHOP is built using MATLAB environment. The front-end of the tool is a GUI that in-
teracts with the user. The flow chart of CUPSHOP tool is shown in Fig. 6. It is divided into
three parts: the lower part is the partitioning algorithm discussed in Sect. 3. The upper part
36 M.B. Abdelhalim et al.

Fig. 5 Target architecture

is the front-end that accepts behavioral, un-timed, algorithmic-level, VHDL input descrip-
tion. This input represents higher level of abstractions than the traditional RTL as it supports
for and while loops, variable—length shift operations, division and power operations, etc.
Moreover, no clocks are defined. This input is transformed into an intermediate format suit-
able for the partitioning process, namely, Control Data Flow Graph (CDFG) [31]. CDFG
is a Directed Acyclic Graph (DAG) used as an input format to high-level synthesis tools
as a language-oriented intermediate format. It consists of nodes that represent operations,
conditions, loops, or modules for hierarchical designs, and edges for data or control flow
between nodes. We used this format as it contains all data-flow constructs needed to model
the mathematical operations as well as the necessary control constructs, i.e. conditions and
loop statements, needed to model the control flow.
The transformation is performed by integrating the CDFG toolkit [30] within our
MATLAB-based tool. This tool kit is publicly available at the following URL http://dal.snu.ac.
kr/?mid=cdfg. This CDFG toolkit has the following advantages:
1. Allows flexible HW/SW partitioning under designer control.
2. Open source and publicly available
3. It provides simple programming interfaces (APIs) that could be used to facilitate inte-
grating the toolkit within larger platforms such as MATLAB.
Our tool also accepts designs described directly using the CDFG format. The CDFG
toolkit includes a parser that is used to obtain the important design information, i.e., the
node operation, edge bit-width, operations over constant values, and so on. Afterwards, the
design is divided into the Basic Scheduling Blocks (BSBs) that determine the partitioning
granularity. As the current trend in system-level designs is to build the system using pre-
designed functional level components [11], we selected the functional-level granularity (FIR
filters, FFT, DCT, . . . etc.) as our granularity level. The BSBs are then scheduled using the
serial and parallel scheduling techniques discussed in Sect. 4.1.
The major disadvantage of the adopted CDFG toolkit is that it accepts only VHDL inputs
which are rarely used for high-level abstraction. However, to add commonly used high-level
abstraction languages (such as C, C++, MATLAB .m language, etc.) to the accepted input
formats, only a translator to the CDFG format supported by the toolkit is needed.
The middle part of Fig. 6 is for the modeling and estimation. It is divided into three parts:
hardware, software, and HW/SW interface modeling and estimation. This part is discussed
in the following subsections.
An integrated high-level hardware/software partitioning methodology 37

Fig. 6 Flowchart of the integrated HW/SW partitioning tool

5.1.1 Hardware modeling and estimation

To realize the datapath of the parallel hardware implementation, we followed the typical
steps of High-Level Synthesis, namely, Scheduling, Binding, and Allocation [19]. The la-
tency is determined using ASAP scheduling algorithm of the whole design. Each operation
obtains its processing time in clock cycles from a small database that contains the latency
of each operation. For simplicity, we assume that each operation takes only one clock cycle
(no mutilcycle or chained operations). The resources allocation is divided into three steps:
operations, registers, and multiplexers allocation.
For operations allocation, after obtaining the start and finish times (control step) of each
operation, we start to count the needed number of resources in two ways: the first way is
that each new operation is assigned to a new resource instance (no resource-sharing). The
second way is to allow resource sharing for the operations that are not overlapping (do not
38 M.B. Abdelhalim et al.

execute in the same control steps), and inserting a multiplexer to control the access of the
shared resources. The two ways may give two different area estimates for the same latency.
The one with the lower area is selected as the parallel hardware implementation.
For the conditional (case) statements, the latency of the node is the maximum latency of
all its branches (cases) (as obtained from the ASAP schedule) as we consider the worst-case
latency. For the loop statements, the loop node is assigned a delay equals to the latency of
the code inside the loop multiplied by the loop bound, the loop bound must be statically
defined before starting the estimation process. No loop-optimization techniques are applied.
On the other hand, for the maximum-latency (serial hardware) solution, a similar ap-
proach to the one discussed above is followed except that in any control step, only one
operation (from a set of ready operations) is scheduled. The operation selection follows
First Come First Serve (FCFS) scheme where the selected operation is the earliest ready
one. This scheme results in the instantiation of one resource unit for each operation type in
the design and groups (binds) all the operations of each type to the instantiated unit. Note
that the bit-width of the instantiated unit must accommodate the maximum bit-width of the
operations bound to it. Regarding the latency, we assumed that the maximum latency is the
summation of the latencies of all the operations, i.e., no pipelining is allowed.
Regarding the control logic, it is assumed that the control logic is implemented based
on a Finite State Machine (FSM). In our approach, we assume a state boundary to be a
clock boundary so that all computations within a state (assigned to the same control step)
are performed concurrently.
After the control-data path synthesis, four hardware parameters are estimated for each
BSB, namely, latency, area, delay, and power consumption. For the lack of space, we give
a quick preview in the following subsection of how we estimated latency, area, and delay
while power estimation is presented in detail Sect. 5.1.3.

5.1.2 Hardware modeling of delay, area, and latency

We apply a common estimation methodology to delay, area, latency and power estimation.
This methodology goes as follows for any of these performance metrics: we consider small
to medium granularity modules (Adders, multipliers, registers, counters, etc., which are,
henceforth, named operational units). We carry out detailed design steps down to the Place
And Route (PAR) phase for these operational units where “optimize for speed” synthesis
option and “best effort” router option are used wherever possible. We then record the values
of the performance metric considered (say area) for these modules as function of their global
parameters (bit width, number of inputs, etc.). Next, we derive phenomenological (area)
estimation models of these operational units as function of their global parameters using
a curve fitting approach. In effect, we build a library of estimation models for these low
granularity modules. To estimate the performance metrics values of a real system with a
higher complexity, the system is partitioned in terms these low-granularity operational units,
then the metrics of the overall system are estimated.
We next present specific details relevant to each of latency, area, and delay metrics.
(a) Regarding the latency, it is defined as the schedule length in clock cycles. It is deter-
mined directly from the two proposed alternative schedules.
(b) The area is estimated in terms of logic elements (LEs). For a complete literature survey
on high level area estimation, the interested reader is referred to the work by Abdelhalim
et al. [2]. Our methodology is similar to the work of Chen et al. [16]. In general, given
the target FPGA architecture, the final area of a functional unit and/or a multiplexer
are largely determined by the total number of input operands and the precision (i.e.,
An integrated high-level hardware/software partitioning methodology 39

bit-width) of the operation performed by the functional unit. The difference between
our method and that of Chen et al. [16] is that we obtain the area estimation starting
from VHDL description, we apply some transformations and optimizations to reduce the
error (i.e. detect constant multipliers, squarers, etc.), and we target Cyclone FPGAs. We
characterize basic operations (adders, MUXs, multipliers, registers, etc.) using curve-
fitting approach to model the area characteristics. We vary the precision for each basic
unit (written in RTL code) up to 32 bits. The maximum number of inputs per operation
is limited to two inputs except for the multiplexers where we vary the number of inputs
up to 32 inputs. We run through the Altera Quartus-II (Ver. 7.1) RTL synthesis and
physical design tool to obtain the actual area of each operational unit. For simplicity,
we assumed that each operation takes only one clock cycle, however, the extension is
straightforward. After all the data points are collected, we use the curve-fitting tool in
MATLAB to derive the best-fit area curves.
Our approach estimates the total area with an average absolute error of 3.2%. For
more details, interested readers can refer to Abdelhalim et al. [2].
(c) The delay estimation is similar to the area estimation. We are concerned with register-
to-register delay. Our approach estimates the delay with an average absolute error of
4.2%. For a complete literature survey on high level delay estimation, the formulas used
for delay calculations, as well as the used test benches, the interested reader is referred
to the work by Abdelhalim et al. [1].

5.1.3 Hardware modeling of power

For the power estimation, we are interested in finding the upper-bound of the power con-
sumption rather than its average (typical) value. This choice is made due to the need of fast
estimation process, hence, no simulation cycles nor the complete knowledge of the inter-
nal structure of each component are needed. For a complete literature survey on high-level
power estimation, the interested reader is referred to the work by Abdelhalim et al. [3].
In general, FPGA power consumption has two components: static power and dynamic
power. In FPGA technology, static power could be considered a constant (60 mW for Cy-
clone EP1C6 devices), hence, we consider how to model and estimate the dynamic power.
Our power estimation methodology is the same as that used for latency, area and delay.
However, note that power estimation depends on the results of latency, area, and delay esti-
mates (results). The characterization process is performed by using Cyclone FPGA spread-
sheet PowerPlay Early Estimator [12] and changing the maximum frequency from 10 to
200 MHz, the number of logic elements (LEs) from 10 to 1000 LEs, and toggle-rates from
10% to 100% (no glitch effects are considered). A best-fit curve fitting approach is applied
to the collected data using the curve-fitting tool inside MATLAB7 environment.
Throughout the following equations, f is the clock frequency in MHz, n is the number
of Flip-Flops, OPs is the number of output and bidirectional pins, tog is the toggle-rate, CL
is the average capacitive load of each output and bidirectional pin measured in Pico Farad
(pF), anormal and aarith are the number of LEs in normal and arithmetic modes respectively,
fan_out is the average fan-out of the component, and S is the global interconnects scale
factor. Equation (7) represents the clock power consumption.
clk
PmW = [49.8n + 1820] × f × 10−5 (7)

For I/Os, PowerPlay calculates the power consumption of the output and bidirectional
pins only as the input pins do not consume power from inside FPGAs. Equation (8) repre-
40 M.B. Abdelhalim et al.

sents the I/O power consumption in mW for 3.3V Single Data Rate LVTTL pins.
IO
PmW = [10.1CL + 50.5] × f × OPs × tog × 10−3 (8)

For the LEs, they are divided into two main categories: LEs belonging to carry chains
such as Adders, Multipliers, etc., i.e., arithmetic mode, and LEs in normal mode. For LEs
in carry chain, (9) represents the power consumption in mW, while for the LEs in normal
mode the power consumption is shown in (10).
LE_C
PmW = S[37a arith + 0.97] × f × tog × 10−5 (9)
LE
PmW = S[(2.17fan_out + 4.64)a normal − 5.4] × f × tog × 10−4 (10)

It is worthy to mention that not all the operations have LEs in arithmetic mode, but in
case of arithmetic operations, aarith in (9) approximately equals the logic elements area for
adders and subtractors and half the logic elements count for the general case multipliers and
dividers.
Average fan-out formulas are derived for each basic operation such as adder, multiplier,
MUX, logic gate, etc., through synthesis of different configurations for each component.
MATLAB7 curve fitting tool is used to obtain such formulas. For the sake of brevity, we
include only one example of these formulas in (11) for a parallel multiplier. m is the highest
bit-width of the inputs.
fan_out = 1.61(m)0.2211 (11)
To estimate the power consumption of the hardware alternatives discussed in Sect. 4.1,
we use the estimated values of the relevant parameters: the area (LEs in normal and arith-
metic modes), latency (in clock cycles), and maximum frequency (in MHz). The clock and
I/O power consumptions are estimated for the whole system. On the other hand, LEs power
estimation is performed in a components-based fashion to extract the local routing informa-
tion using fan-out formulas. The total LEs power consumption is the summation of these
individual powers.
For parallel designs with no control constructs, after obtaining the estimates of area, la-
tency, maximum frequency, and fan-out, the only remaining parameter to be determined is
the toggle-rate. As we mentioned before, we are interested in the upper-bound power con-
sumption. A common practice in pre-synthesis power estimation is to assume unity toggle-
rate, i.e., all inputs and internal signals toggles at the clock rate. Such assumption leads to
loose upper-bound estimate. We propose to assign a toggle-rate for each component (opera-
tional unit) according to the schedule and allocation of the component. Such assumption is
reasonable at early design stages as most of the components are considered as black boxes,
hence probabilistic (statistical) and simulation-based methods used to obtain toggle-rates,
at the transistor or gate levels, are not applicable as they assume detailed knowledge of the
internal structure of the components. As no resource sharing is applied, each component will
work once during each computation cycle of the module. Hence, it is reasonable to assume
a toggle-rate equals (L/Latency) for all components and hence for all LEs. L here denotes
the delay of each component in terms of clock cycles. For simplicity, we assumed that each
block produces its results in a single clock cycle; hence, L is assumed unity. For example,
in Fig. 4(a), each component operates once in each two clock cycles, so toggle-rate could be
assumed = 50%.
The LE power consumption term needs to be corrected to account for the global inter-
connect power losses. The scale factor term S in (9) and (10) accounts for this effect. Our
An integrated high-level hardware/software partitioning methodology 41

results indicate that a value of S = 2 yields the best fit between the estimated and calculated
power consumptions of our test examples. This value is close to the value obtained in Li et
al. [42] for this factor (S = 2.46).
For Designs with resource-sharing and control constructs, the shared component and its
associated multiplexer may run more than once in a single computation cycle. Therefore the
toggle-rate could be assumed as Active_Clocks/Latency where Active_Clocks is the number
of active clock cycles of the shared component. For example, in Fig. 4(b), the multiplier is
shared and has two active clock cycles. Therefore its toggle-rate is 2/3, the controlling mul-
tiplexers have also a toggle-rate 2/3, while the adder is not shared and its toggle-rate is 1/3.
For testing our approach, our results need to be compared to the actual power, evalu-
ated at the post-PAR stage. We use PowerPlay tool of Altera Quartus-II, in the statistical
mode, to obtain the actual power. Unfortunately, PowerPlay in this mode accepts only a
single value that represents the overall toggle-rate of the design components. When the re-
sources are shared, such unified value is not acceptable. We propose to use a unified Average
Toggle-Rate (ATR) that is a weighted sum of the individual components toggle-rates. The
weights are the areas, in LEs, of the individual components inside the module. Equation (12)
represents the mathematical formulation of ATR.
C
× ACi
i=1 ai
ATR = 100 × (12)
L×a
where C represents the number of components in the design, L represents the maximum
latency, a represents the total area of the design, ai represents the area of component i
plus the area of the associated multiplexers, and ACi represents the number active cycles of
component i. For example, in Fig. 4(b), C = 2 (adder and multiplier), L = 3 clock cycles,
a = 151 LEs, a1 = amult + 2 × amux = 102 + 2 × 16 = 134 LEs, AC1 = 2 clock cycles,
a2 = aadd = 17 LEs, AC2 = 1 clock cycle. Therefore, ATR = 100 × (134×2)+(17×1)
3×151
= 62.65%.
After comparing our results with those obtained from PowerPlay tool, our approach es-
timates the LE, clock, and I/O power consumption values with average absolute errors of
7.7%, 6.23%, and 0.25% respectively. The large LE power consumption error is accepted
taking into account that there is an inherited error in LE area estimation of 3.2% as well as a
delay error of 4.2%. For more details regarding the accuracy of our algorithm, and the used
test benches, interested readers can refer to [3].
Even though we focus on estimating the power for specific FPGA technology, our ap-
proach could be adapted to any other technology if the following information is given (or
obtained):
1. The static power dissipation of the FPGA.
2. Characterization information for I/O, clock, and logic elements.
3. Average fan-out of each operational unit.
4. Interconnects effect

5.1.4 Software modeling and estimation

SW components are implemented using NIOS-II soft processor core [9]. The NIOS-II
processor is a general-purpose RISC processor core. Currently, there are three NIOS-II
cores: NIOS-II/f “fast” core, NIOS-II/s “standard” core, and NIOS-II/e “economy” core. In
our work, we adopt the standard core (NIOS-II/s) as it represents the average point between
the two extremes represented by the fast and the economic cores. However, the modeling
could be easily adapted to the other cores as well as other microprocessors. All instructions
42 M.B. Abdelhalim et al.

Fig. 7 Average current for different instructions of NIOS-II/s core [18]

take one or more cycles to execute. Some instructions have other penalties associated with
their execution. For example, instructions that flush the pipeline cause up to three instruc-
tions after them to be cancelled and this creates a three-cycle penalty and an execution time
of four cycles. Instructions that write to the shared-bus (Avalon bus) are stalled until the
transfer completes, as is shown in the next subsection. Regarding multipliers and dividers,
they are built from Logic Element (LE) resources. The NIOS-II/s core employs a 5-stages
pipeline.
The processor area is constant (about 1000 LEs). The area of hardware dividers and
multipliers could be added if needed. The recommended clock frequency is 80 MHz. The
latency in clock cycles could be calculated similar to the hardware serial schedule. The la-
tency of each operation is obtained in terms of processor cycles from Altera [9]. Due to
the pipelined architecture, we assumed that the Cycles-Per-Instruction (CPI) equals unity.
Finally, 4 extra clock cycles are added to model the initialization of the pipeline. Regard-
ing the power consumption, we used the instruction-level estimation. The instruction-level
estimation model is based on physical measurements, where the instantaneous current used
by the target processor during the execution of each instruction is measured. This model,
initially proposed by Tiwari et al. [58], executes loops of hundreds of instances of the same
instruction while measuring the average current drawn by the processor during that time.
The result of this experiment is the so called “base cost” of each instruction which is defined
as the power consumed during the execution of a given instruction.
De Holanda et al. [18] measured the consumption profile for each instruction of the
NIOS-II/S standard core processor implemented in Altera Cyclone device. Figure 7 shows
the measured current for each type of instruction in the Standard configuration of the NIOS-
II at 80 MHz. All the instructions inside the infinite loop are executed with random operands
to obtain a more representative average current consumption. De Holanda et al. [18] didn’t
include the power consumption of the multiply and divide instructions. Therefore, to model
the power consumption of multiply and divide commands, we used the power consumption
of the NOP command that models the consumption of the pipeline stages except the exe-
cution stage. The execution stage is implemented in hardware, i.e. LEs. Therefore, we used
the hardware power consumption techniques discussed in 5.1.3 for these two instructions.
An integrated high-level hardware/software partitioning methodology 43

5.1.5 Hardware/software interface modeling and estimation

The communications between the SW and HW sides are implemented using shared bus
scheme. The used shared bus is the Avalon bus. Very little work is available in the literature
that addresses the high-level or abstract modeling of Avalon bus. To the best of our knowl-
edge, only the work done by Lin et al. [43] addresses this topic by building communications
library. So all the interface circuits are pre-designed and stored in an interface library for
the HW modules. The designer selects the appropriate interface circuit when synthesizing
his/her design. In contrast to the Lin et al. [43] approach, we fix the interface and model it
appropriately. Our model is divided into two main parts; the first part is used to obtain esti-
mation formulas for the extra area and latency in both the HW and SW components due to
HW/SW communications. The second part deals with the area of the communication media
itself, i.e., the Avalon switch fabric.
For the first part, the first step is to determine the suitable transfer mode that could be used
for master transactions. In our system, as the processor is the master; therefore, slave-write
transactions represents SW to HW communications. As dedicated registers are allocated in
HW components such that the processor can write immediately on them, no delay cycles are
needed for master transfers. Therefore, fundamental Slave-Write Transfer is the best mode in
terms of complexity and performance to be adopted in our system Altera [8]. Equation (13)
represents the SW → HW transfer latency in clock cycles
 
I /P _Width
Latency = (13)
32

where I/P_Width is the sum of bit-widths of the edges that cross the HW/SW boundary to
a specific hardware-mapped node (excluding its clock input). The system controller triggers
the hardware component to run after all its data have been written, i.e., at the end of transfer.
Regarding the constant in the denominator, it represents the Avalon data bus bit-width; it is
fixed to its maximum configuration to reduce the latency. Equation (14) represents the area
of the SW→ HW interface in LEs.

Area = I /P _Width + log2 delay + Comp_Area (14)

The first term in (14) represents the area of the register array used to store the transferred
data. The second term represents the area consumed by the system controller to generate the
trigger signal, as a counter is instantiated to count the number of cycles during the transfer. If
the counter reaches the last cycle of the transfer, it generates the trigger signal through a com-
parator. Finally, the area of the comparator is represented through the third term in (14).The
area of the comparator could be estimated using our area estimation approach discussed
earlier [2].
Regarding the slave-read transactions, where the processor reads data from a certain HW
component through the Avalon bus, the HW component may not be ready at that time;
therefore, the processor is stalled until the HW component is ready. The most suitable mode
in this case is the Slave-Read Transfer with Peripheral-Controlled Wait States [8]. Equa-
tion (13) is again used to represent the HW→SW transfer latency. Equation (14) is used to
represent the area after removing the first term, because there is no need to allocate a reg-
ister array as our scheduling techniques ensures that the results of any HW component are
stored in an allocated register(s). For the second part, the only term to be considered is the
area of the Avalon bus itself where this area scales with the number of edges crossing the
HW/SW boundary. The example system, shown in Fig. 8, represents the internal structure
44 M.B. Abdelhalim et al.

Fig. 8 Avalon bus module block diagram—an example system [8]

of the Avalon bus. The bus modeling could be easily done by modeling the multiplexer,
arbitrator, and slave chip-select signal generator.
The multiplexer is modeled previously (Area in [2] and power consumption in [3]). The
delay is not considered as the transaction clock speed is fixed to the processor clock speed.
As our target architecture includes a single master, which is the processor, no need to
model the arbitrator as it is associated only with multi-master accesses.
Regarding Chip-Select signal generator, each slave has its own Chip-Select signal. The
generation of such signal depends on the address of the slave. As the address width could
be configures as 8, 16 or 32 bits as shown in Altera [8], our tool detects the suitable config-
uration, i.e. address-bus width, according to the number of slaves or HW components. For
example, if the number of HW components is less than or equals 256, the address bus width
will be 8 bits, and so on. Then, the area of each generator is modeled as a comparator. The
overall area is calculated by multiplying the comparator area by the number of generators
(HW components) to obtain the overall generators area. Finally, this area is used to obtain
the power consumption from (10).
An integrated high-level hardware/software partitioning methodology 45

Fig. 9 Block diagram of the JPEG encoder

6 Case study

To test the complete CUPSHOP tool, we built a modified JPEG encoder system by utiliz-
ing the experience gained in Sect. 3.4 with such systems. The modified JPEG encoder is
enhanced relative to the original one by the following features: communication costs are in-
cluded, hardware implementation alternatives are investigated, and area of microprocessor
core is taken into consideration. In our methodology, we rely on quick estimates rather than
lengthy measurements as in the original design. In addition, we write a single high-level de-
scription for each component, in VHDL as it is the language supported by the CDFG toolkit,
instead of writing two implementations for each component, one for SW, in C for example,
and one for HW, in Verilog or VHDL for example.
The most important modification to the JPEG encoder is that the DCT block is now di-
vided into many components using the work done by Chen et al. [17] as shown in Fig. 9.
The reason behind the modification is that in the original system, as shown in of Table 1,
the DCT blocks consume large amount of power and area in their hardware implementation
46 M.B. Abdelhalim et al.

Table 4 Summary of our results

Area Delay Results Area (%) Delay (%) Power (%) Slack (ns)
constrain constrain

– – 0/1/022/111/200/220002/000002/ 29.5456 64.4932 36.6392 –


011/002220/002000/111/000/022
20% – 0/1/201/202/220/020200/020000/ 19.9026 83.2882 33.2111 –
000/002020/200000/000/000/000
– 20% 1/1/222/111/112/221211/222120/ 60.2637 18.8088 49.4822 859.4772
111/222222/221020/111/000/000
25% 75% 0/0/020/222/220/201200/100000/ 21.4686 73.8557 29.5658 825.635
000/000000/000010/111/000/022
70% 30% 1/1/000/111/220/221200/001002/ 55.7688 29.2584 56.5141 535.0809
111/112211/222222/111/000/002
75% 25% 2/1/022/222/222/222212/002202/ 58.3083 23.1499 50.8099 1334.8
111/222122/022222/111/000/220
50% 50% 2/1/012/022/002/200221/200220/ 39.3915 49.6172 46.3545 276.2051
011/000222/000220/111/000/020

(around 75% of the power consumption and around 85% of the area of all hardware imple-
mentations). In addition, it is the slowest component within all software implementations (it
consumes 99.27% of the all software execution time). Such unbalance could bias the parti-
tioning results and leads to local optimum solutions and limited partitioning capabilities and
optimization opportunities.
For brevity, the details of the modified description are only summarized. The design
consists of 47 components. The maximum HW area is the summation of the parallel HW
areas of the individual components and it equals 24650 LEs, the maximum delay is the
summation of SW delay and it equals 72150 ns. Finally, the maximum power is the average
of serial HW, parallel HW, and SW maximum powers, and it equals 2315 mW.
For all the test cases, the following PSO parameters are used: population Size = 50,
c1 = c2 = 2, w decreases linearly from 1 to 0 in 50 iterations, minimum number of itera-
tions before stopping is 100, number of Re-Excited PSO rounds is fixed to 5. Finally, the
termination threshold (ε) is set to 2.2204 × 10−14 .
Table 4 summarizes the results. The column entitled slack shows the time difference,
in nanoseconds, between the achieved partitioned design and the deadline constraint. This
value can be used to relax the timing requirements of parallel implementations by serializing
some of their operations and applying resource-sharing to reduce the hardware area further.
The column entitled results shows the vector representation of the components list in the
following order (separated by a slash):
HEADER_STRIPPER/RGB_GEN / RGB_Y, U, V / IN_BUFF_CONT / IN_BUFF/CHEN_
ADD_SUB / DCT / MEM_TRANSPOSE / CHEN_ADD_SUB / DCT / QUANT / ZIGZAG /
DPCM.
From the table, our tool finds feasible solutions in all cases. In addition, it is clear that the
partitioning results of some nodes are unaffected. For example, zigzag nodes are software
nodes while quantization nodes are serial hardware nodes. Those nodes could be removed
later from the search space to reduce the problem size without affecting the solutions quality;
hence, an exact algorithm could easily find the optimum solution of the reduced problem.
Finally, all the solutions contain serial as well as parallel hardware implementations. These
results show that our hardware alternative implementations approach strikes a reasonable
An integrated high-level hardware/software partitioning methodology 47

balance and saves the need for considering all possible hardware implementations for each
node.

7 Conclusions and future work

In this article, we built a high-level HW/SW partitioning tool (code named CUPSHOP)
based on an integrated high-level methodology. First, the partitioning algorithm adopted
within the methodology is the Particle Swarm Optimization technique (PSO). The PSO tech-
nique is modified such that an efficient restarting approach could be utilized to escape local
optimum solutions. We name this modification the re-excited PSO.
Efficient cost function formulation is of a paramount importance for an efficient opti-
mization algorithm. However, each component in the design must have hardware as well as
software implementation costs that guide the optimization algorithm. The hardware cost in
our platform is modeled using two extreme implementations that bound all other schedule-
dependent implementations. Communications cost between hardware and software domains
are carefully considered in contrast to other approaches that completely ignore such term.
To apply real-life case studies, fast, accurate, pre-partitioning estimation techniques must
be used to calculate the implementation costs of each design component. We proposed novel
approach to estimate the hardware implementation costs, namely, area, latency, delay, and
power consumption. Our target platform consists of a single processor that communicates
with the hardware component(s) via shared bus. The used processor is NIOS-II core which
is a soft-core processor implemented within Cyclone FPGAs. Regarding software imple-
mentation costs, the area and clock frequency are considered constants while the latency is
obtained assuming unity Cycles Per Instruction (CPI). A modified version of a published
software power estimator is used to obtain software power consumption. From the above
costs, a combined model is used to estimate the extra cost due to extra hardware (software)
resources needed to implement the communication infrastructure.
Two case studies were performed; both of them are built around the well-known JPEG en-
coder system. The first case study compares our results with published results of a HW/SW
co-designed JPEG encoder. The comparison focuses on the re-excited PSO technique only.
The results prove that our algorithm compares favorably with the published results. The
second case study addresses a modified version of the aforementioned JPEG encoder. The
main focus of this case study is to demonstrate the flexibility of our CUPSHOP tool, to
stress the need for adding the communications cost, and to delineate the effectiveness of
using two-bounding hardware alternatives.
Our future plans are directed to overcoming two important limitations of the presented
partitioning methodology. The input description language is the first of these two limita-
tions. Currently the presented methodology accepts only behavioral VHDL description of
the design. Our future plans call for support of C-based input description. The second im-
portant limitation is its lack of support for pipelining at the hardware level. Pipelining is
widely used low-cost approach for speeding up the hardware performance. We also plan to
support pipelining in the future versions of CUPSHOP.

References

1. Abdelhalim MB, Habib SE-D (2007) Fast FPGA-based delay estimation for a novel hardware/software
partitioning scheme. In: Proceedings of the 2nd international design and test workshop, Cairo, Egypt, pp
175–181
48 M.B. Abdelhalim et al.

2. Abdelhalim MB, Habib SE-D (2008) Fast FPGA-based area and latency estimation for a novel hard-
ware/software partitioning scheme. In: Proceedings of the 21st Canadian conference on electrical and
computer engineering, Niagara Falls, Ontario, Canada, pp 775–779
3. Abdelhalim MB, Habib SE-D (2008) Fast hardware upper-bound power estimation for a novel FPGA-
based HW/SW partitioning scheme. In: Proceedings of the IEEE computer society annual symposium
on VLSI, Montpellier, France, pp 393–398
4. Abdelhalim MB, Salama AE, Habib SE-D (2006) Hardware software partitioning using particle swarm
optimization technique. In: Proceedings of the 6th intl workshop on SOC for real-time applications,
IWSOC’06, Cairo, Egypt, 2006, pp 189–194
5. Abdelhalim MB, Salama AE, Habib SE-D (2007) Constrained and unconstrained hardware software par-
titioning using particle swarm optimization technique. In: Proceedings of the 2nd international embedded
system symposium, Irvine, CA, USA, pp 207–220
6. Adhipathi P (2004) Model based approach to hardware/software partitioning of SOC designs. MSc The-
sis, Virginia Polytechnic Institute and State University, USA
7. Altera (2003). Apex 20 K device family architecture. Handbook
8. Altera (2005) Avalon interface specification, Version 3.1
9. Altera (2006) NIOS-II processor reference handbook
10. Altera (2007) Cyclone device handbook
11. Altera (2007) Quartus II Version 7.1 handbook—volume 4: SOPC builder
12. Altera (2008) Cyclone PowerPlay early power estimator
13. Armstrong JR, Adhipathi P, Baker JM Jr (2002) Model and synthesis directed task assignment for sys-
tems on a chip. In: 15th international conference on parallel and distributed computing systems, Cam-
bridge, USA
14. Binh NN, Imai M, Shiomi A, Hikichi N (1996) A hardware/software partitioning algorithm for designing
pipelined ASIPs with least gate counts. In: Proceedings of 33rd design automation conference, Las Vegas,
Nevada, USA, pp 527–532
15. Chatha KS, Vemuri R (2001) MAGELLAN: multiway hardware-software partitioning and scheduling for
latency minimization of hierarchical control-dataflow task graphs. In: Proceedings of the 9th international
symposium on hardware/software codesign, Copenhagen, Denmark, pp 42–47
16. Chen D, Cong J, Fan Y, Zhang Z (2007) High-level power estimation and low-power design space ex-
ploration for FPGAs. In: ASPDAC’07, Yokohama, Japan, pp 529–534
17. Chen W, Smith CH, Fralick S (1977) A fast computation algorithm for the DCT. IEEE Trans Commun
25:1004–1009
18. De Holanda JA, Assumpcao J, Wolf DF, Marques E, Cardoso JM (2007) On adapting power estimation
models for embedded soft-core processors. In: International symposium on industrial embedded systems,
Costa da Caparica, Portugal, pp 345–348
19. De Micheli G (1994) Synthesis and optimization of digital circuits. McGraw Hill, New York
20. De Souza DC, De Barros MA, Naviner LAB, Neto BGA (2003) On relevant quality criteria for optimized
partitioning methods. In: Proceedings of 45th midwest symposium on circuits and systems, Cairo, Egypt,
pp 1502–1505
21. Ditzel M (2004) Power-aware architecting for data-dominated applications. PhD thesis, Delft University
of Technology, The Netherlands
22. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of the 6th
international symposium on micro-machine and human science, Nagoya, Japan, pp 39–43
23. Eberhart RC, Shi Y (2001) Particle swarm optimization: developments, applications and resources. In:
Proceedings of 2001 congress on evolutionary computation, Seoul, Korea, pp 81–86
24. Eles P, Peng Z, Kuchcinski K, Doboli A (1997) System level HW/SW partitioning based on simulated
annealing and tabu search. Des Autom Embed Syst 2(1):5–32
25. Ernest RL (1997) Target architectures. In: Staunstrup J, Wolf W (eds) Hardware/software co-design:
principles and practice. Kluwer Academic, Dordrecht, pp 113–148
26. Hanselman D, Littlefield B (2001) Mastering MATLAB 6. Prentice Hall, New York
27. Hassan R, Cohanim B, de Weck O, Venter G (2005) A comparison of particle swarm optimization and
the genetic algorithm. In: 1st AIAA multidisciplinary design optimization specialist conference, Austin,
Texas
28. Haupt RL, Haupt SE (2004) Practical genetic algorithms, 2nd edn. Wiley Interscience, New York
29. Henkel J, Ernst R (2001) An approach to automated hardware/software partitioning using a flexible
granularity that is driven by high-level estimation techniques. IEEE Trans Very Large Scale Integr (VLSI)
Syst 9(2):273–289
30. Jeon J, Ahn Y, Choi K (2002) CDFG toolkit user’s guide. Technical Report No. SNU-EE-TR-2002-8.
School of Electronic Engineering, Seoul National University, South Korea
An integrated high-level hardware/software partitioning methodology 49

31. Jerraya AA, Romdhani M, Valderama C, Le Marrec AP, Hessel F, Marchioro GF, Daveau JM (1997)
Languages for system-level specification and design. In: Staunstrup J, Wolf W (eds) Hardware/software
co-design: principles and practice. Kluwer Academic, Dordrecht, pp 113–148
32. Jha NK, Dick RP (1998) MOGAC: a multiobjective genetic algorithm for hardware-software co-
synthesis of distributed embedded systems. IEEE Trans Comput-Aided Des Integr Circuits Syst
17(10):920–935
33. Jigang W, Srikanthan T, Chaen G (2010) Algorithmic aspects of hardware/software partitioning: 1D
search algorithms. IEEE Trans Comput 59(4):532–544
34. Jonsson B (2005) A JPEG encoder in SystemC. MSc thesis, Lulea University of Technology, Sweden
35. Kalavade A, Lee EA (1994) A global criticality/local phase driven algorithm for the constrained hard-
ware/software partitioning problem. In: Proceedings of 3rd international workshop on hardware/software
codesign, Grenoble, France, pp 42–48
36. Kalavade A, Lee EA (2002) The extended partitioning problem: hardware-software mapping and
implementation-bin selection. In: De Micheli G, Ernest RL, Wolf W (eds) Readings in hardware/software
co-design. Morgan Kaufmann, San Mateo, pp 293–312
37. Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: Proceedings of IEEE international
conference on neural networks, Perth, Australia, pp 1942–1948
38. Knudsen PV, Madsen J (1996) PACE: a dynamic programming algorithm for hardware/software parti-
tioning. In: 4th international workshop on hardware/software co-design, Pittsburgh, Pennsylvania, USA,
pp 85–92
39. Lee TY, Fan YH, Cheng YM, Tsai CC, Hsiao RS (2007) Hardware-oriented partition for embedded
multiprocessor FPGA systems. In: Proceedings of the second international conference on innovative
computing, information and control, Kumamoto, Japan, pp 65–68
40. Lee TY, Fan YH, Cheng YM, Tsai CC, Hsiao RS (2007) An efficiently hardware-software partitioning
for embedded multiprocessor FPGA system. In: Proceedings of international multiconference of engi-
neers and computer scientists, Hong Kong, pp 346–351
41. Lee TY, Fan YH, Cheng YM, Tsai CC, Hsiao RS (2007) Enhancement of hardware-software partition
for embedded multiprocessor FPGA systems. In: Proceedings of the 3rd international conference on
international information hiding and multimedia signal processing, Kaohsiung, Taiwan, pp 19–22
42. Li F, Lin Y, He L, Chen D, Cong J (2005) Power modeling and characteristics of field programmable
gate arrays. IEEE Trans Comput-Aided Integr Circuits Syst 24(11):1712–1724
43. Lin F, Wang H, Bian J (2005) HW/SW interface synthesis based on Avalon bus specification for NIOS-
oriented SoC design. In: Proceedings of the international conference on field-programmable technology,
Kent Ridge Guild House, Singapore, pp 305–306
44. Lin TY, Hung YT, Chang RG (2006) Efficient hardware/software partitioning approach for embedded
multiprocessor systems. In: Proceedings of international symposium on VLSI design, automation and
test, Hsinchu, Taiwan, pp 231–234
45. Lopez-Vallejo M, Lopez JC (2003) On the hardware-software partitioning problem: system modeling
and partitioning techniques. ACM Trans Des Autom Electron Syst 8(3):269–297
46. Luenberger DG (1984) Linear and non-linear programming. Addison-Wesley, Reading
47. Luthra M, Gupta S, Dutt N, Gupta R, Nicolau A (2003) Interface synthesis using memory mapping for
an FPGA platform. In: Proceedings of the 21st international conference on computer design, San Jose,
CA, USA, pp 140–145
48. Madsen J, Gorde J, Knudsen PV, Petersen ME, Haxthausen A (1997) LYCOS: the lyngby co-synthesis
system. Des Autom Embed Syst 2(2):195–236
49. Mann ZA (2004) Partitioning algorithms for hardware/software co-design. PhD thesis, Budapest Uni-
versity of Technology and Economics, Hungary
50. Marrec PL, Valderrama CA, Hessel F, Jerraya AA, Attia M, Cayrol O (1998) Hardware, software and
mechanical cosimulation for automotive applications. In: Proceedings of 9th international workshop on
rapid system prototyping, Leuven, Belgium, pp 202–206
51. Mei B, Schaumont P, Vernalde S (2000) A hardware/software partitioning and scheduling algorithm for
dynamically reconfigurable embedded systems. In: Proceedings of 11th ProRISC, Veldhoven, Nether-
lands
52. Nieman R (1998) Hardware/software co-design for data flow dominated embedded systems. Kluwer
Academic, Dordrecht
53. Poli R (2008) Analysis of the publications on the applications of particle swarm optimization. J Artif
Evol Appl 2008:685175
54. Shi Y, Eberhart RC (1998) Parameter selection in particle swarm optimization. In: Proceedings of 7th
annual conference on evolutionary computation, New York, USA, pp 591–601
55. Shi Y, Eberhart RC (1999) Empirical study of particle swarm optimization. In: Proceedings of the 1999
congress on evolutionary computation, Washington DC, USA, pp 1945–1950
50 M.B. Abdelhalim et al.

56. Stitt G (2008) Hardware/software partitioning with multi-version implementation exploration. In: Pro-
ceedings of great lakes symposium in VLSI, Orlando, Florida, USA, pp 143–146
57. Stitt G, Vahid F, McGregor G, Einloth B (2005) Hardware/software partitioning of software binaries: a
case study of H.264 decoder. In: IEEE/ACM CODES+ISSS’05, New York, USA, pp 285–290
58. Tiwari V, Malik S, Wolfe A (1994) Power analysis of embedded software: a first step towards software
power minimization. IEEE Trans Very Large Scale Integr (VLSI) Syst 2(4):437–445
59. Tong Q, Zou X, Tong H, Gao F, Zhang Q (2008) Hardware/software partitioning in embedded system
based on novel united evolutionary algorithm scheme. In: 2nd international conference on computer and
electrical engineering, Phuket Island, Thailand, pp 141–144
60. Vahid F (2002) Partitioning sequential programs for CAD using a three-step approach. ACM Trans Des
Autom Electron Syst 7(3):413–429
61. Xilinx Inc (2007) Virtex-II Pro and Virtex-II Pro X platform FPGAs: complete data sheet
62. Zheng YL, Ma LH, Zhang LY, Qian JX (2003) On the convergence analysis and parameter selection in
particle swarm optimization. In: Proceedings of the 2nd international conference on machine learning
and cybernetics, Xi-an, China, pp 1802–1807
63. Zou Y, Zhuang Z, Cheng H (2004) HW-SW partitioning based on genetic algorithm. In: Proceedings of
congress on evolutionary computation, Anhui, China, pp 628–633

You might also like