Professional Documents
Culture Documents
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
I. INTRODUCTION
IELD-PROGRAMMABLE gate arrays (FPGAs) are consistently improving in capacity and performance, and are
now among the most popular devices in the market. With their
regular structure, they also scale easily to future technologies.
However, the large overheads of their programmable interconnect are severely limiting their growth. In an SRAM-based
FPGA, the programmable interconnect resources take almost
70% of the die area and consume the major part of FPGA
power. Furthermore, for most designs, they also constitute
more than 50% of the critical path delay. Therefore, a reduction
in the interconnect resources will greatly benefit FPGAs.
Three-dimensional integration is a promising technique for
reducing wire-lengths. It involves the stacking of multiple silicon wafers interconnected with vias. If every layer in a 3-D
chip implements a normal (2-D) FPGA, stacking reduces the
average Manhattan distance between logic blocks, which leads
to fewer interconnect resources. Consequently, 3-D integration
of FPGAs (which we refer to as 3-D FPGA) is an attractive technique to improve the performance of FPGAs. Other gains, such
as reduced design footprint and the ability to integrate different
technologies, further favor 3-D FPGAs.
Manuscript received November 10, 2006; revised March 26, 2007. This work
was supported in part by the National Science Foundation under NSF CAREER
0093085, NSF CCF 0702617, and by a grant from MARCO/GSRC.
A. Gayasen is with R&D Department, Synopsys, Sunnyvale, CA 94043 USA
(e-mail: gayasen@cse.psu.edu).
N. Vijaykrishnan is with the Departments of Computer Science and Engineering and Electrical Engineering, Pennsylvania State University, University
Park, PA 16802 USA.
M. Kandemir is with the Computer Science and Engineering Department,
Pennsylvania State University, University Park, PA 16802 USA.
A. Rahman is with Xilinx Research Laboratories, San Jose, CA 95124 USA.
Digital Object Identifier 10.1109/TVLSI.2008.2000456
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
883
the design. They typically provide the user with the thermal
of the package, which is used to estimate the
resistance
using
junction temperature
(1)
is the ambient temperature, and
refers to the
where
total power consumed by the chip.
As designing the package for the worst case junction temperature started becoming too expensive, researchers started
looking at design level solutions to reduce the temperature. A
common example is dynamic thermal management (DTM),
where the design is run at a reduced power (and performance)
if the chip temperature increases beyond a previously set
threshold. Thermal sensors measure the temperature, and
power is reduced by lowering the clock frequency or the supply
voltage, and clock-gating [18].
Design level techniques can also aid in removing the heat generated by the design. For example, thermal-aware floorplanning
tries to reduce the hotspots on the die by distributing the temperature uniformly [19], [20]. Researchers have mostly focused
at microprocessors in these works. Thermal placement is a similar technique applied at the placement stage. Chen and Sapatnekar [21] proposed a partition-driven algorithm for standard
cell thermal placement. Thermal floorplanning and placement
are particularly attractive because they impact the performance
less than DTM.
On the modeling front, several researchers have developed
tools for estimating the die temperature. Among them, HotSpot
[22] is an architecture-level thermal simulator, which can perform transient as well as steady-state temperature estimation.
HS3d [23] is another architecture-level tool that performs only
steady-state temperature estimation, but is orders of magnitude faster than HotSpot. Both HS3d and HotSpot provide
the flexibility to set several package and die parameters, such
as the spreader thickness, package-to-air thermal resistance
, and substrate thickness. Since, in this work, we
look at only steady-state temperatures, we use HS3d.
Recently, some researchers have proposed solutions for
thermal issues in 3-D ICs too. Cong et al. [24] suggested a
thermal-driven floorplanning for 3-D. Goplen and Sapatnekar
[25] also proposed a temperature-driven placement algorithm
for 3-D standard cell application-specific integrated circuits
(ASICs). Studies have also indicated that careful insertion of
thermal vias can reduce the peak temperature [26], [27].
Thermal issues in FPGAs are relatively unexplored. Some
researchers have proposed the use of distributed sensors for
monitoring temperatures in FPGAs [28], [29]. They, however,
considered only configurable logic blocks (CLBs) in the fabric,
and consequently, observed very little temperature variations
across the die. In contrast, we focus on platform FPGAs,
containing embedded circuit blocks including high-speed transceivers, multipliers, delay-locked loops (DLLs), and memories
[6], [30]. Here, we first characterize the temperature distribution in a modern 2-D FPGA, and then observe how it changes
when we stack multiple such layers. Next, we propose changes
in the placement of hard blocks in the 3-D FPGA to reduce the
die temperature.
884
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
TABLE I
VIA PROPERTIES
III. BACKGROUND
X ,Y ,X ,Y
Since the wafer-bonding 3-D technology is still being perfected, several methods are being explored. These methods result in different via dimensions and wafer thicknesses. For this
study, we explore three different methods, which result in the
via dimensions shown in Table I. Via 1 reflects the process from
Tezzaron [1], which uses a wafer thickness of 10 m. Depending
on the process steps, we may need handle wafers to support the
thin wafers. For a two-layer f2f stack, we may be able to avoid
the handle wafers if we bond first and then thin the wafers. At
the other extreme is via 3 that uses 50 m wafers, which reflects
the process in [32]. A larger wafer thickness imparts mechanical strength to the wafers, and eliminates the need for handle
wafers. Via 2 reflects an intermediate process that we use to illustrate the trends due to via dimensions. Note that via length
is important only for f2b integration technology. An integration
technology from MIT uses silicon-on-insulator (SOI) wafers to
reduce the device layer thickness to less than a micrometer [33].
We do not model this technology in this work.
B. 2-D Switch Boxes
Our study will focus on island-style SRAM-based FPGAs.
FPGAs from Xilinx and Altera belong to this category. The
CLB consists of lookup tables (LUTs) and flip-flops (FFs).
Routing wires (tracks) and programmable switches constitute
the routing channel. Channel width refers to the number of
tracks in a channel. The CLBs connect to the channel through
connection boxes. The routing wires connect among themselves
through switch boxes.
Switch box topology refers to the connectivity provided by
the switch box. Researchers have explored several topologies
[34][38] (see Fig. 2). The subset (also called disjoint) topology,
used in Xilinx XC4000 FPGAs, connects tracks of the same
number in all four directions. This divides the channel into disjoint sets of tracks and a net uses the same track number for
its route. Universal topology provides more flexibility than disjoint. It facilitates connectivity for all possible global routes of
two-terminal nets.
Research has shown that the universal switch box results in
fewer tracks in the channel [39]. Hyper-universal switch boxes
provide even greater flexibility, and facilitate the connectivity
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
885
and
Fig. 4 shows the SBs we created for this study for
. Normally, the 3-D SB is visualized as a cube, where
each face of the cube represents one of the directions. However,
for ease of illustration, we have flattened the SB and shown it as
,
a hexagon, where each side represents a direction: North
, East
, West
, top
, or bottom
.
South
Furthermore, we show only the connections to the vertical faces
( and ). For all SBs, the horizontal wires (CHANX and
CHANY) use either the subset or universal connections among
themselves. These connections were described in Section II-B
and illustrated in Fig. 2. For clarity, we do not show the horizontal connections in Fig. 4. The first four SBs use subset connections among the horizontal wires, and the last two use universal. Fig. 4 also tabulates the connections from the vertical
refers to the th terminal on the
face of
faces, where
the SB. The Appendix formally describes the six SBs.
The first SB [subset, see Fig. 4(a)] is an extension of the 2-D
subset SB. This SB connects the same track number on all sides.
Consequently, the entire routing fabric gets divided into disjoint
subsets, and a net uses the same track number for its entire route.
Note that only the first of the horizontal wires connect to
the vias. While these wires have a flexibility of 5 (three connections to the other horizontal directions and two to the vertical
ones), the other wires connect to only horizontal tracks (flexi). Apart from decreasing the routing flexibility, this
bility
results in a difference in the capacitive loads of the horizontal
wires: large for the first wires and small for the rest.
The second SB [subset-split, see Fig. 4(b)] modifies the
subset SB by allowing the first horizontal tracks to connect
to the vias going above, and the last to those going below.
This implies that now there are twice as many horizontal wires
that connect to the vertical wires. Therefore, if nets do not
fan-out at the SB, then this SB provides greater flexibility to
the vertical directions. A limitation, however, is that the first
can only go above, and the last , only below. Consequently, if
a net needs to fan-out to both top and bottom, then it needs to
use two horizontal tracks (compared to one for subset). This SB
distributes the capacitive loads on the horizontal tracks more
evenly than the subset SB.
The subset-split SB, although more flexible than subset,
suffers from the disjoint property of subset SBs: the entire
routing fabric is divided into disjoint subsets and a net can
use only one of those subsets. This disjoint subset consists of
(where
vertical track and horizontal tracks and
). In order to improve upon this, we
modified the connections to the vertical faces as shown in
connects to track 1 on the side
Fig. 4(d). Now, terminal
, but track 0 on side
. This allows the net to switch tracks
at the SBs. We call this SB subset-twist.
The main objective of the subset-twist SB is to improve the
flexibility in the vertical direction. Another way to achieve this
is by adding more switches to the vertical facesthe approach
used by the next, subset-more SB [see Fig. 4(c)]. Here, the verterminals on
tical terminal connects to both and
). The extra
the horizontal faces (where
switches have a twofold effect. On the one hand, they improve
the flexibility in the vertical direction, and on the other, they increase the area of the SB and the capacitive loads on the wires.
886
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
H = 4, V
= 2. (a) Subset. (b) Subset-split. (c) Subset-more. (d) Subset-twist. (e) Universal-twist. (f) Universal-more.
where
is the number of nets in the design,
is the number of sink pins of net ,
is the estimated
delay from the source of net to sink number . For each net
,
, and
denote the and spans of its bounding
factor compensates for the fact that
box, respectively. The
the bounding box wire-length model underestimates the wiring
necessary to connect nets with more than three terminals. Its
value depends on the number of terminals of net .
and
are the average channel capacities in the - and
-directions respectively, over the bounding box of net . The
value of adjusts the weight given to congestion in the cost
function. The larger the value of , the more wiring in narrow
channels is penalized relative to wiring in wider channels. A
value of 1 has been previously found to work best, and is used
in this work.
, to reduce
To the 2-D cost function, we add a term,
the vertical span of the nets. This is similar to what was proposed
in [3], except that, similar to the congestion cost terms for - and
-directions, we incorporate congestion in
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
887
The
and
numbers are estimated by
VPR modified for 3-D. The via area is calculated by counting
the number of vias and multiplying it by the area taken by a via.
While comparing the area of two architectures, we estimate the
total FPGA area and divide it by the number of CLBs in the
fabric to estimate the area per CLB. Thus, the area numbers in
Section IV include the area for one logic block (CLB) and the
routing resources (horizontal wires, switches, and vias) associated with it.
C. Results and Analysis
Here, we show the results for two extremes of 3-D integration: first, a simple stack of two layers and, second, a more aggressive stack of five layers. Together they capture the trends
seen by varying the number of layers in a 3-D FPGA. While
the two-layer FPGA can be fabricated using f2f or f2b wafer
bonding, the five-layer FPGA must be fabricated using f2b. For
all these technology points, we evaluate the effects of different
via dimensions shown in Table I. The metric we primarily look
at to evaluate an architecture is the area-delay product (ADP),
888
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
TABLE II
EXPERIMENTAL RESULTS FOR UNIVERSAL-TWIST (65 nm, 3-m PITCH VIAS)
because it is inversely proportional to the throughput of the device [43]. In all the figures in this section, we plot the geometric
means over 20 MCNC benchmarks.
The first step towards evaluating 3-D FPGAs is comparing
them with 2-D FPGAs. Fig. 6 shows the average area (per CLB),
delay, and ADP for 1, 2, and 5 layers in 65-nm technology. For
both 2 and 5 layers, it shows the results for the three via technologies of Table I. The key 2-layers-f2f-3 m in Fig. 6 refers
to the use of two device layers, stacked using f2f bonding with
vias at 3- m pitch (via 1 in Table I). Fig. 6 uses the same switch
box (universal-twist) for all cases.
The area is estimated as explained in Section III-B3. Note that
area reduces as we increase the number of layers or reduce the
pitch of the vias. The smallest area is obtained when five layers
are used with 3- m-pitch vias, in which case, the CLBs area is
only 84% of the single-layer case. Furthermore, we observe that
the area of the two-layer FPGA using f2f bonding remains constant with increasing via pitches. This happens because the vias
in this case are accommodated within the transistors footprint,
and the CLB area is determined by the transistors.
The critical path delay also reduces with increasing number
of layers (second set of bars in Fig. 6). The five-layer FPGA
with 5- m-pitch vias (best case) reduces the delay by 24.7%
compared with the single layer case and by 14% compared with
the two-layer case. This happens because interconnect lengths
(and hence delays) reduce as we increase the number of layers.
F2f and f2b technologies do not have any significant impact on
the delay.
The reduction of area and delay in 3-D combine to significantly reduce the area-delay product of the FPGA (third set
of bars in Fig. 6). The five-layer FPGA reduces the area-delay
product by 36% (for 3- m-pitch vias), while a two-layer
FPGA does so by about 20%, when compared to a single-layer
FPGA. These results justify the interest in 3-D FPGAs and
demonstrate that we can obtain significant improvements even
by the relatively simple integration of two FPGA layers. The
results also indicate that even by using the moderately aggressive 5- m-pitch vias, we can significantly improve upon 2-D
FPGAs. Table II tabulates some of the results for five-layer
FPGA using universal-twist switch box.
Now, we explore the different switch boxes to find which one
gives the best values for area, delay, and area-delay product.
Fig. 7 shows the results for five layers, using 65-nm process and
3- m-pitch vias (via 1 in Table I). The results for two layers
follow a similar trend. The first set of bars in Fig. 7 compare the
flexibility in the vertical direction of the various SBs by looking
at the minimum number of vias they take for the designs to route.
Observe that the universal-more type of SB provides the greatest
flexibility (minimum number of vias). In fact, it uses only 49%
of the vias needed by the subset SB. It also results in the minimum channel width among all the SBs. However, the total area
is determined by both, the vias and the number of transistors in
the fabric. Since universal-more uses extra switches to increase
flexibility, we observe that the total area taken by the FPGA
using universal-more SB is larger than that of the one with universal-twist SB. This indicates that the universal-twist SB provides greater flexibility per switch than the universal-more SB.
While the area metric reduced to 88% by using universaltwist SB instead of the subset SB, the critical path delay does
not show such a strong variation. This happens because the
timing-driven router of 3-D VPR gives less weight to congestion
for timing-critical nets, which implies that they almost always
take the shortest possible route. The smallest delay is obtained
for the subset-split case. Note that adding more switches to the
SB increases the delay, which is explained by the larger parasitic capacitances due to these switches. Because the variation
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
889
Fig. 8. Comparing the switch boxes for different via technologies for five-layer
FPGA.
Fig. 10. Thermal profile of 4VFX100.
Fig. 9. Comparing the switch boxes for different process nodes for five-layer
FPGA.
device. Circuits run slower when they are hot, and their lifetime
reduces exponentially with increasing temperature. Besides,
plastic packages can only withstand relatively low temperatures. Furthermore, leakage power increases exponentially with
temperature, which can cause a thermal runaway.
All these factors have forced chip manufacturers to employ
techniques to control the die temperature. Section I described
some of these techniques.
Thermal issues in FPGAs are relatively unexplored. Some researchers have proposed the use of distributed sensors for monitoring temperatures in FPGAs [28], [29]. They, however, considered only CLBs in the fabric, and consequently, observed
very little temperature variations across the die. In contrast, we
focus on platform FPGAs, containing embedded circuit blocks
including high-speed transceivers, multipliers, DLLs, and memories [6], [30]. Here, we first characterize the temperature distribution in a modern 2-D FPGA, and then observe how it changes
when we stack multiple such layers. Next, we propose changes
in the placement of hard blocks in the 3-D FPGA to reduce the
die temperature.
A. Thermal-Characterization of FPGAs: 2-D to 3-D
Most modern FPGAs incorporate hard blocks in the fabric.
These blocks exhibit different power characteristics, leading to
variations in power densities within the chip. We calculated the
power numbers for blocks in a Virtex-4 FX100 device by using
Xilinx power spreadsheets and observed that the power densities
vary from 0.78 for the DSP blocks to 11.46 for the DCMs (see
Table III). Such a vast range results in large temperature variations within the FPGA die (see Fig. 10). The hotspots occur near
the MGTs and DCMs, which are about 14 C above the coolest
portions.
Table IV shows the temperatures for 3-D FPGAs consisting
of identical FPGA layers of 4VFX100. The temperatures were
estimated using HS3d [23] with the parameters listed in Table V.
value of 0.5 C/W reflects the thermal resistance
The
of a high-end package with a moderate heat sink. Note that, for
both 2-D and 3-D FPGAs, we used the same power numbers
for individual blocks (listed in Table III). This doubles the total
FPGA power for a two-layer FPGA compared to a single-layer
FPGA. This is a pessimistic estimate, because power consumption in the routing fabric is expected to reduce when we stack
890
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
TABLE III
POWER DENSITIES IN 4VFX100 (FREQ: 500 MHz)
TABLE IV
EFFECT OF STACKING ON TEMPERATURE
TABLE V
PARAMETERS FOR TEMPERATURE ESTIMATION IN HS3D
for a four-layer FPGA. This large variation in temperature indicates that the peak temperature could be reduced by distributing
the hot blocks more evenly across the fabric. Interestingly, 3-D
technology parameters change the temperatures only minutely.
For a four-layer FPGA, layer thickness changes the peak temperature by up to 4.4 C, while thermal vias could decrease the
peak temperature by up to 3.4 C. Fig. 11 shows the effect of
stacking on temperature, as well as the possible variations because of 3-D technology parameters.
B. Thermal-Aware 3-D FPGA Organization
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
891
Fig. 12. 3-D FPGA organizations. (a) Two-layer stacked. (b) Two-layer thermal.
TABLE VI
THERMAL-AWARE 3-D FPGA DESIGN
drops from 25.7 C for the stacked design to only 2.6 C for the
thermal-aware design.
In the previous experiments, the heat sink is attached closest
to the layer consuming the maximum power. Previous studies
have suggested that this should be preferred. In fact, researchers
have proposed thermal-aware 3-D floorplanning that tries to
place the hot blocks closer to the sink [24]. In order to see the
effect of sink placement, we attached it to the layer containing
only CLBs in the two-layer thermal organization. Table VI also
shows the temperature for this case (two-layer thermal inverted).
We observe that the temperature increases only very slightly because of this change. This happens because the vertical distances
are small compared to the horizontal dimensions of the FPGA.
VI. CONCLUSION
We demonstrated that 3-D FPGAs can provide significant advantages over 2-D by reducing the interconnect area and the
total area-delay product. The 3-D FPGA with five layers and
3- m-pitch vias reduces the area-delay product of a 2-D FPGA
by 36%. This number may increase even further with improvements in 3-D technology.
We designed and evaluated several switch boxes for 3-D
FPGAs and showed that the area-delay product depends
heavily on the switch box topology. In 65-nm technology, the
area-delay product for our universal-twist switch box is 15%
lower than that of the subset switch box for 5- m-pitch vias.
We further showed that the universal switch boxes become even
better with scaling process technology, as well as with larger
vias. However, adding more switches to the universal SB does
not provide any benefit.
Three-dimensional integration, however, increases the die
temperature. Our experiments indicate that the peak temperature for a four-layer FPGA could be 2.4 times that of a
single-layer FPGA. However, the large variation in temperature
within the 3-D package allows us to reorganize the 3-D FPGA
to reduce the peak temperature. For a two-layer FPGA, the peak
temperature reduced by 16 C when the design was altered to
create a more uniform temperature profile.
then
then
;
else
;
end if
else if
then
;
end if
SUBSET_TWIST:
UNIVERSAL_TWIST:
then
if
if
then
;
else if
then
;
then
;
else if
end if
else if
if
then
then
;
then
else if
;
else if
then
892
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 7, JULY 2008
;
end if
end if
SUBSET_MORE:
//Works because its assumed that
if
then
then
if
;
else
;
;
end if
end if
UNIVERSAL_MORE:
//Works because its assumed that
if
then
then
if
;
;
then
else if
;
;
else
;
end if
else if
if
then
then
;
;
then
else if
;
;
else
;
end if
end if
REFERENCES
GAYASEN et al.: DESIGNING A 3-D FPGA: SWITCH BOX ARCHITECTURE AND THERMAL ISSUES
893
Mahmut Kandemir (M03) received the Ph.D. degree in electrical engineering and computer science
from Syracuse University, Syracuse, NY, in 1999.
He is an Associate Professor with the Computer
Science and Engineering Department, Pennsylvania
State University, University Park. His main research
interests include optimizing compilers, I/O intensive
applications, and power-aware computing.
Dr. Kandemir is a member of the ACM.His
research is supported by NSF, DARPA, SRC and
MICROSOFT.