Professional Documents
Culture Documents
2. BACKGROUND
This section provides background on clock networks and
describes previous work related to either clock-aware
placement or FPGA clock networks in general.
3.1 Architecture
The parameterized framework supports FPGA clock networks
with multiple local and global clock regions and varying
amounts of flexibility at various levels of the clock network. In
this framework, clock networks have three stages (see Fig. 2).
The first stage programmably connects some number of clock
sources (clock pads, PLL/DLL outputs, or internal signals)
from the periphery of the FPGA to the center of the clock
regions. These connections can be made using either local or
global clock networks. The local clock networks connect the
periphery to the center of the regions directly, whereas the
global clock networks are routed to the center of the FPGA
before spanning out to the clock regions. The second stage
programmably connects the clock networks from the center of
the clock regions to the logic blocks within those regions.
Finally, the third stage connects the clock networks from the
inputs of each logic block to the flip-flops within that logic
block.
(a) Stage 1
(b) Stage 2
(c) Stage 3
Description
Baseline
Value
Range
in this
Paper
1-4
1-4
128
128
54
0-54
54
54
10
6-14
1-3
4. CLOCK-AWARE PLACEMENT
The previous section presented a parameterized framework to
describe FPGA clock networks. This section describes new
clock-aware CAD techniques that are needed to implement user
circuits on FPGAs that have clock networks that can be
described with this framework. Specifically, it describes new
placement techniques that ensure that placement solutions are
legal and minimize the impact of the constraints imposed by the
clock network.
Placement can have a significant impact on power, delay,
and routability. Placing logic blocks of a critical-path close
together minimizes delay and placing clusters connected by
many wires close together improves power and routability.
Power can be further minimized by assigning more weight to
connections with high toggle rates, as described in [10]. For
applications with many clock domains, however, constraints
that limit the number of clock nets within ribs (Wrib) and within
a region (Wreg) can interfere with these optimizations.
To investigate how clock networks constraints affect the
power, delay, and routability during placement, the T-VPlace
algorithm from [11] was enhanced to make it clock-aware. T-
Timing Cost
Wiring Cost
+ (1 )
Previous Timing Cost
Previous Wiring Cost
Timing Cost
Wiring Cost
+ (1 )
+
Previous Timing Cost
Previous Wiring Cost
Clock Cost
Previous Clock Cost
assign_global_resources_statically () {
num_global_min = max( 0, num_clock_nets num_local_regions * num_local_clocks );
num_global_safe = min( num_clock_nets, num_global_clocks,
num_global_min + num_clock_nets *
Safety_Margin );
/* biggest to smallest by number of pins*/
sorted_clock_nets = sort( clock_nets, num_pins );
allow the placer to adapt to the change. The following pseudocode is based on T-VPlace, which is fully described in [11].
The clock nets are reassigned using the routine described in
Fig. 5. The routine begins by calculating how spread out each
clock net is by calculating the locality distance, which counts
how many clock pins would need to move in order to make the
clock net local. After sorting each net by locality distance, it
then assigns global clock network resources to half of the
remaining local clock nets that have the highest locality
distance.
4.3 Legalization
The final enhancement needed to make the placer clock-aware
is to produce a placement that is legal. Legalization ensures
that the number of different clock nets used in every region is
less than or equal to the number of clock resources that are
available.
We consider two approaches. The first approach finds a
legal solution before placement. A legal solution is found using
simulated annealing with the timing and wiring cost
components turned off (leaving only the clock cost component
turned on). If a legal placement is found, the actual placement
is then performed with all three cost components turned on but
only allowing legal swaps.
The second approach involves legalizing during the actual
placement. The algorithm starts with a random placement and
then uses simulated annealing with the timing, wiring, and
clock costs turned on. To gradually legalize the placement, the
clock cost component is modified to severely penalize swaps
that make the placement either illegal or more illegal. In other
words, illegal swaps are allowed but they have a higher cost.
Explicitly, we multiplied the cost of using a rib, spine, or global
routing resource that is not available by a constant value, called
Illegal_Factor. In our experiments, we found 10 to be a
suitable value for Illegal_Factor.
Cost
standard
standard
standard
standard
gradual
gradual
gradual
gradual
Assignment
statically
statically
dynamic
dynamic
statically
statically
dynamic
dynamic
Legalization
pre-placement
during placement
pre-placement
during placement
pre-placement
during placement
pre-placement
during placement
# LUTs
# FFs
# Clocks
15778
5710
15879
2361
10
15
1lrg40sml
9452
3232
41
30seq20comb
16398
2984
50
3lrg50sml
10574
4433
53
10lrg
15med
40mixedsize
12966
4784
40
4lrg4med16sml
11139
5458
24
50mixedsize
16477
5681
50
5lrg35sml
11397
3646
40
60mixedsize
15246
5217
60
70s1423
13632
7308
70
70sml
4994
2465
70
lotsaffs
7712
4216
19
lotsaffs2
9309
4567
32
P1
P2
P3
P4
P5
P6
P7
P8 Orig
2.99
4.60
6.43
5.50
8.16
6.59
10.4
6.69
10.4
5.23
3.52
2.30
0.00
3.06
4.65
6.52
5.46
9.77
6.71
10.2
6.96
10.1
5.40
3.60
2.28
8.03
3.07
4.66
6.45
5.17
7.90
6.69
10.0
6.31
10.0
10.8
4.92
3.46
2.27
6.45
3.79
4.52
6.37
5.11
8.02
6.41
10.3
6.43
9.49
4.86
3.33
2.14
2.89
4.48
6.85
5.15
7.63
6.33
9.59
6.10
9.09
0.0
5.11
3.47
2.24
6.34
2.89
4.49
6.85
5.14
7.84
6.60
9.77
6.76
9.29
4.89
3.73
2.28
6.42
2.86
4.41
6.22
4.96
7.46
6.20
9.56
6.15
10.0
10.1
4.79
3.39
2.31
6.34
2.84
4.69
6.02
4.84
7.58
6.37
9.59
5.93
8.85
10.7
4.77
3.30
2.14
6.63
2.91
4.50
6.48
5.42
7.79
6.34
9.98
6.25
9.56
11.0
5.07
3.73
2.32
the other placers (P3, P4, P7, and P8), which assign global
clock network resource dynamically. Intuitively, assigning the
global clock dynamically works better since more information
is available when the resources are being assigned.
Specifically, the placer can determine which nets are most
spread out.
Another notable observation is that the gradual clock cost
function produced placements that were more energy efficient
than the standard cost function. On average, placers P5 to P8
were between 3.5% and 5.4% more energy efficient than
placers P1 to P4, respectively.
Table 5 compares the average energy of the clock and
general purpose routing resource and the average critical-path
delay of each clock-aware placer to that of the original nonclock-aware placer from VPR. Note that the original placer
ignores clock nets and any of the associated placement
constraints. Therefore, the placement solutions produced by the
original placer are not legal. This table serves to highlight the
impact that the clock constraints have on each of the placers.
Table 5. Impact of clock constraints for each placer.
Placer
P1
P2
P3
P4
P5
P6
P7
P8
Orig
Clock
Energy
(nJ)
1.79
1.80
1.69
1.68
2.00
1.98
1.84
1.83
2.14
% Diff
-16.3
-15.9
-20.7
-21.2
-6.6
-7.2
-13.7
-14.2
-
Routing
Energy
(nJ)
2.40
2.43
2.38
2.38
1.96
1.97
2.04
1.96
1.92
% Diff
Tcrit
(ns)
% Diff
24.9
26.5
23.7
23.7
1.7
2.5
6.0
1.8
-
57.9
58.4
58.6
57.8
57.8
57.4
58.8
57.4
57.9
0.0
0.9
1.1
-0.3
-0.1
-1.0
1.5
-0.9
-
Benchmark
10lrg
15med
1lrg40sml
30sq20cm
3lrg50sml
40mixed
4lg4m16s
50mixed
5lrg35sl
60mixed
70s1423
70sml
lotsaffs2
lotsaffs
Geomean
% Difference
Tcrit (ns)
Energy per Cycle (nJ)
Wlb=1 Wlb=2 Wlb=3 Wlb=1 Wlb=2 Wlb=3
65.1
66.8
70.7
68.7
34.6
65.0
71.1
64.7
66.0
40.5
28.8
64.7
35.5
35.1
54.7
-5.0
69.0
71.2
71.7
71.3
35.0
72.0
76.1
68.6
77.4
39.4
30.3
70.7
36.0
35.9
57.6
-
66.5
70.1
70.8
71.3
35.1
69.3
77.7
68.2
75.4
39.1
30.3
68.3
36.0
36.0
57.0
-1.0
6.33
2.67
4.20
5.83
4.61
7.03
5.95
8.89
5.67
8.49
8.93
4.53
3.17
2.11
5.06
-7.2
6.48
2.86
4.46
6.24
4.95
7.58
7.28
9.71
6.16
8.78
10.7
4.88
3.40
2.20
5.45
-
7.50
2.88
4.76
6.24
5.18
7.83
6.48
9.73
6.25
9.08
10.7
4.83
3.40
2.26
5.58
2.3
Benchmark
10lrg
15med
1lrg40sml
30sq20cm
3lrg50sml
40mixed
4lg4m16s
50mixed
5lrg35sl
60mixed
70s1423
70sml
lotsaffs2
lotsaffs
Geomean
% Diff
6.80
3.03
5.83
6.45
4.89
8.59
6.37
9.68
5.99
8.86
5.04
3.32
2.41
5.51
6.0
6.44
2.83
4.43
6.87
5.02
7.54
6.40
9.52
5.96
9.01
4.84
3.30
2.21
5.27
1.4
6.37
2.84
4.45
7.33
4.89
7.56
6.42
9.54
6.62
8.96
11.4
4.75
3.36
2.17
5.32
2.4
6.34
2.84
4.69
6.02
4.84
7.58
6.37
9.59
5.93
8.85
10.7
4.77
3.30
2.14
5.20
-
6.33
2.89
4.25
5.99
4.85
7.94
6.25
9.50
6.26
8.91
10.4
4.98
3.34
2.22
5.23
0.6
6.34
2.86
4.83
5.99
4.85
7.98
6.17
10.6
6.01
8.94
10.3
4.67
3.36
2.22
5.28
1.6
13
14
6.35
2.83
4.39
6.10
4.86
7.47
6.23
9.57
5.97
8.91
10.3
4.69
3.39
2.23
5.19
-0.2
6.34
2.84
4.31
5.97
4.75
7.53
6.35
9.60
6.10
9.09
10.3
4.88
3.34
2.22
5.20
0.0
Clock
%
Energy (nJ) Diff
1.82
1.83
1.84
1.83
1.83
1.84
1.83
1.84
1.84
-0.6
0.0
0.3
0.2
0.5
0.1
0.4
0.3
Routing
Energy (nJ)
%
Diff
Tcrit
(ns)
%
Diff
2.25
2.00
2.02
1.98
1.98
2.01
2.05
1.96
1.98
13.8
1.3
2.0
0.1
1.4
3.9
-1.1
0.1
57.2
56.8
57.3
57.3
57.1
56.9
57.4
57.1
57.0
0.3
-0.4
0.4
0.3
-0.3
0.6
0.1
-0.1
%
Routing
Diff Energy (nJ)
0.0
2.01
-0.2
2.01
-0.5
1.99
-0.1
2.01
-0.2
2.01
0.0
2.01
-0.1
2.01
-0.1
2.01
-0.1
2.01
-0.1
2.01
%
Diff
0.0
0.2
-0.9
0.0
-0.1
0.0
0.0
0.0
0.0
0.0
Tcrit
(ns)
58.1
58.6
58.1
58.0
58.0
58.1
58.1
58.1
58.1
58.1
%
Diff
0.8
-0.1
-0.1
-0.2
0.0
0.0
0.0
0.0
0.0
11
8.21
3.51
5.66
8.46
9.41
8.02
12.60
7.69
4.57
3.03
22
6.34
2.84
4.69
6.02
4.84
7.58
6.37
9.59
5.93
8.85
10.7
4.77
3.30
2.14
33
5.98
2.66
3.98
5.61
4.38
6.84
5.78
8.59
5.62
8.02
9.5
4.36
2.97
1.96
44
5.63
2.58
3.91
5.39
5.24
6.77
5.58
8.29
5.24
7.82
9.30
4.30
2.79
1.87
Geomean
% Diff
6.5
30.3
5.0
-
4.6
-8.8
4.4
-12.5
7. CONCLUSION
This paper described new clock-aware placement techniques
for FPGAs and then examined how the architecture of FPGA