Professional Documents
Culture Documents
General Terms
Algorithms, Measurement, Performance, Design.
Keywords
RT-level power estimation, Data path optimization, FPGA power
reduction.
1. INTRODUCTION
Power optimization has attracted increased attention due to the
rapid growth of personal wireless communications, batterypowered devices and portable digital applications. Compared to
ASIC chips, FPGA chips are generally perceived as not power
efficient because they use a larger amount of transistors to
provide programmability. Large power consumption of FPGA
chips becomes a constraining factor for FPGA designs to enter
main-stream low-power applications. Our goal is to reduce the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ISLPED03 August 25-27, 2003, Seoul Korea.
Copyright 2003 ACM 1-58113-682-X/03/0008$5.00
power evaluation flow for the FPGA. Section 3 presents our RTlevel power estimator. Section 4 first shows the functional unit
library we build, and then it presents our simulated annealing
algorithm and MUX optimization algorithm for power reduction.
Section 5 presents the experimental data and Section 6 concludes
this paper.
BL E
#1
N
N
O u tp u ts
I
I
In p u t s
BL E
#N
C lo ck
Tri-state buffer
routing switch
Logic block
T = kN P
where T is the number of external pins of a logic network; N is
number of gates contained in the network; k is the average
number of pins per gate in the network, and p is the Rents
parameter. A series of works followed starting with Landman and
Russo in 1971 [15]. The classical work [16, 17] gives good
estimates for post-layout interconnect wire length. More recent
work improves the estimation by considering occupying
probability [18] or recursively applying Rents rule throughout an
1 l <
Region I:
i (l ) =
Routing wire
k
2
Region II:
i (l ) =
k
6
w h ere
su c h th a t
l
2 N l 2 + 2 N l )l 2 p 4
3
N l < 2 N
(2
N l )3 l 2 p4
f .o .
f .o . + 1
I (a < l < b) =
b
a
i (l ) d l
TCin (Oi , Oi +1 ) = DH ( IN i j , IN i +1 j )
(1)
j =1
K 1
(2)
j =1
TPin =
TC
i =1
in
Bit _ width ( N K 1)
2
dd
C Wire . In
equation (4), the static power of all the idle LUTs and local and
global buffers are counted in. The total power is the sum of
PDynamic and PStatic.
(3)
(4)
4. POWER OPTIMIZATION
In this section, we will first introduce our RT-level library
characterization, and then we present a simulated annealing
procedure and a MUX optimization algorithm for power
reduction.
FU
add24bit_bk
add24bit_cla
ash24bit
cmp24bit
mul18bit_nbw
mul18bit_wall
Mul18bit_wall_s2
Mul18bit_wall_s4
mux24bit_2to1
mux24bit_4to1
mux24bit_8to1
mux24bit_16to1
mux24bit_32to1
Implementation
Brent-Kung
Carry look-ahead
Arithmetic shifter
Comparator
Non-Booth-recoded
Booth-recoded Wallace
Wallace tree 2 stage
Wallace tree 4 stage
Synopsys synthesis
Synopsys synthesis
Synopsys synthesis
Synopsys synthesis
Synopsys synthesis
Area
Delay
(clb)
(ns)
(w)
42
26
67
14
288
280
286
262
6
18
66
135
276
6.09
11.78
5.33
4.17
14.55
14.78
11.43
7.14
0.57
2.34
4.60
6.91
10.93
0.016
0.010
0.023
0.003
0.246
0.308
0.164
0.114
0.002
0.005
0.023
0.083
0.240
Power
fpgaEva_LP
Area, Delay, Power
Merge: merges two bindings into one, i.e., the operations bound
to the two FUs are combined into one FU.
Split: splits one binding into two. Reverse of Merge.
Mix: selects two bindings, merge them, sort the merged
operations according to their slack, and then split the operations.
Each of these moves has its own attributes. For example, Reselect
may pick a smaller FU (possibly larger delay) for operations that
are not on critical path (slack > 0) of the CDFG without violating
latency constraint, and Mix may lead to rebinding the operations
that have larger slacks into a pipe-lined function unit such as
Mul18bit_wall_s4. Split will be disabled when the temperature is
low so the binding solution will not be dramatically changed.
After each move, a list scheduling is called to verify the total
latency. Then, the left edge algorithm is used for register binding
followed by MUX generation. The total amount of CLBs is
estimated through the FU and MUX characterization library, and
the routing wires are estimated as shown in Section 3.1. Finally,
the cost is calculated for the current binding and scheduling
solution. The annealing process exits when the percentage of
accepted moves are low enough.
the MUX cost in the final data path, especially when scheduling
and functional unit binding are fixed. A register allocation
algorithm based on weighted bipartite matching was proposed in
[22] trying to optimize the MUX cost before functional unit
binding. We design a new cost function so the register binding
can be carried out after the functional unit binding and reduce the
total amount of MUX ports directly. Meanwhile, we allow the
register number to be relaxed by a small percentage, which will
introduce more flexibility to reduce MUX cost.
First, the algorithm calls the left edge algorithm to get the
minimum number of registers required. We then relax the register
number by a certain ratio. After that, we get a register set R.
The variables will be assigned to R iteratively. In an iteration,
according to the ascending order of the left edges of the variables,
we select a mutually incompatible set of unassigned variables VIC,
where |VIC| = |R| (We may also relax the size of VIC to include
more variables in order to catch a more global picture). We then
construct a weighted bipartite graph G = (VICR, E), where E =
{(v, r) | vVIC and r R such that v is compatible with the
variables allocated in r}. Each edge will be attached a weight,
which will be discussed later. After solving the minimum weight
bipartite matching, we allocate the variables to R according to the
matching. The process is repeated until all the variables are
allocated.
Left-edge
LOPASS
Comparison
Bench- Reg No. Mux Port Reg No. Mux Port Reg No. Mux Port
marks
dir
44
320
48
237
9.1%
-25.9%
honda
34
214
37
154
8.8%
-28.0%
mcm
54
173
59
146
9.3%
-15.6%
pr
31
82
34
68
9.7%
-17.1%
wang
29
101
32
74
10.3% -26.7%
Ave.
9.4% -22.7%
MUX
(a)
Power
(w)
1.79
1.33
0.88
0.47
0.46
Actual
Wire
Length
188891
100804
97292
43662
41167
Power
(w)
1.68
1.04
1.08
0.58
0.52
Ave.
Estimation Error
Wire
Length
-16.3%
18.2%
-21.3%
-9.3%
-2.9%
Power
6.0%
27.5%
-18.8%
-18.8%
-10.1%
13.6%
16.2%
Functional
Unit
MUX
Wire
Length
158187
119140
76569
39617
39978
Functional
Unit
(b)
Bench Node
MultiCycle
Adder
plier
marks No.
dir
152
9
16
32
honda 101
9
14
29
mcm
98
23
6
36
pr
46
13
8
24
wang 52
5
8
29
5. EXPERIMENTAL RESULTS
Our LOw Power Architectural Synthesis System (LOPASS)
consists of the simultaneous binding and scheduling followed by
MUX optimization. We will show our MUX optimization results
separately in Section 5.1 before we show the power reduction
LOPASS
Reg
MultiReg
Adder
Cycle
plier
No.
No.
75
5
7
32
55
55
3
6
28
40
118
4
3
35
59
33
2
2
23
34
29
2
2
28
30
S-BC usually uses multipliers of different sizes for constant handling and
timing optimization. Although S-BC uses more multipliers than
LOPASS, the sizes of their multipliers can be smaller than those used in
LOPASS. LOPASS only uses multipliers of the same size. We set high
effort option for S-BC.
S-BC
Benchm LUT
arks
No.
dir
18658
LOPASS
2.55
11281
Comparison
Delay Power
(ns)
(w) LUT No. Delay
53.5
1.68
Power
honda
16426
43.4
1.85
7584
40
1.04
mcm
15991
46.8
1.97
7396
50.8
1.08
-53.7%
pr
7663
30.2
0.72
3873
34.4
0.58
wang
9057
35.7
0.83
3925
35.4
0.52
Ave.
8.5%
-44.8%
7. ACKNOWLEDGMENTS
This work is partially supported by the NSF Grant CCR-0096383
and Altera Corporation under the California MICRO program.
8. REFERENCES
[1] A. Raghunathan and N.K. Jha, Behavioral synthesis for low-power,
International Conference on Computer Design, Oct 1994.
[2] P. Kollig and B.M. Al-Hashimi, A new approach to simultaneous
scheduling, allocation and binding in high level synthesis, IEE
Electronics Letters, vol. 33, Aug 1997.