Professional Documents
Culture Documents
Pipelined FPGA FloatingPoint Multiplier
Malte Baesler, SvenOle Voigt, Thomas Teufel
Institute for Reliable Computing
Hamburg University of Technology
September 1st, 2010
Agenda
1. Introduction
a)Why Decimal FloatingPoint Arithmetic?
b)What are the Requirements on the Decimal Multiplier?
2. Decimal FixedPoint Multiplier
3. Decimal FloatingPoint Multiplier
4. Post Place & Route Results
a)FixedPoint Multiplier
b)FloatingPoint Multiplier
Introduction
Why decimal floatingpoint arithmetic?
● avoid conversion errors
● human centric applications
● required for commercial applications, e.g. interest
calculation
Why decimal floatingpoint arithmetic?
● avoid conversion errors
● human centric applications
● required for commercial applications, e.g. interest
calculation
IEEE Standard 7542008 for FloatingPoint Arithmetic
● published in August 2008
● replaces IEEE 7541985 and IEEE 8541987
● binary and decimal floatingpoint arithmetic
FloatingPoint Arithmetic
IEEE 7542008 FloatingPoint Arithmetic
decimal64 data format
● radix b=10
● significand precision p=16
● exponent range q =398, qmax=369
min
Requirements on the multiplier
● fast
● low resource usage
● IEEE 7542008 compliant
● pipelined due to reuse in accurate scalar product
→ fully combinational
● optimized for FPGA architecture (Virtex5)
– internal fast carry chain
– DSP48E slices
Requirements on the multiplier
● fast
● low resource usage
● IEEE 7542008 compliant
● pipelined due to reuse in accurate scalar product
→ fully combinational
● optimized for FPGA architecture (Virtex5)
– internal fast carry chain
– DSP48E slices
Decimal FixedPoint Multiplier
FixedPoint Multiplier
How does multiplication work?
school method:
● partial product generation
● accumulation of partial products
1234⋅5678 = 5000⋅1234
600⋅1234
70⋅1234
8⋅1234
FixedPoint Multiplier
● based on concepts of A. Vazquez, E. Antelo, P.Montuschi 1
● fully combinational
● BCD recoding schemes
● fast partial product generation
● fast BCD4221 carry save adder reduction tree
1
“A new family of highperformance parallel decimal multipliers“,
18th IEEE Symposium on Computer Arithmetic, June 2007
M. Baesler, S. Voigt, T. Teufel Decimal FloatingPoint Multiplier 09/01/2010
6/30
Introduction Decimal FixedPoint Multiplier Decimal FloatingPoint Multiplier Post Place & Route Results
FixedPoint Multiplier
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
Decimal Recoding
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
Decimal Recoding
● transforms the multiplier's digit set into
{0, 9 } {−5,5 }
● reduces number of multiplicand multiples
A×1, A×2, A×3, A×4, A×5
● very fast operation, no ripple carry
Partial Product Generator
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
Partial Product Generator
BCD4221 Carry Save Adder Tree
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
Carry Save Adder Tree
carry save adder tree sums up p+1 partial products
P1
P2
P3
...
Pp+1
Carry Save Adder Tree
CSA tree with respect to decimal recoding
sign extension P1
sign extension P2 C1
sign extension P3 C2
...
Pp+1 Cp
Carry Save Adder Tree
improved CSA tree with respect to decimal recoding
P1
P2 C1
P3 C2
...
Pp+1 Cp
improved sign extension
Improved Sign Extension
● adding several words composed of leading nines and
following zeros always yields to a word composed of 0, 8,
and 9. For example 999999990000
999900000000
990000000000
= x989899990000
● position of 0, 8, and 9 can be calculated very fast by
means of FPGA's fast carry chain
{
9 for c ink =0∧sign k =1
X
NegDC
k
0 else
in
= 8 for c =1∧sign k =1
k
out
c =c
k
in
k1 =
{ 1 for sign k =1
c ink else
FixedPoint Multiplier
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
FixedPoint Multiplier
P0 BCD4221
SBCD8421
CPA
ABCD8421 P1 BCD4221
PPGen
CSAT
p digits 2p digits
...
DRec
BBCD8421 Pp+1 BCD4221
p digits 2p S_sBCD4221
2p S_wBCD4221
PPGen Partial Product Generator CSAT Carry Save Adder Tree
DRec Decimal Recoding Unit CPA Carry Propagation Adder
Decimal FloatingPoint Multiplier
Decimal FloatingPoint Multiplier
● additional units for rounding, exponent computation and
data format encoding/decoding
● based on M. Erle, B. Hickmann, M.Schulte 2
● early estimation of shift left amount
● fully IEEE 7542008 compliant
● support for gradual underflow and all rounding modes
● adapted to FPGA technology
2
“Decimal FloatingPoint Multiplication“, IEEE Transaction on
Computers, VOL. 58, NO. 7, July 2009
M. Baesler, S. Voigt, T. Teufel Decimal FloatingPoint Multiplier 09/01/2010
13/30
Introduction Decimal FixedPoint Multiplier Decimal FloatingPoint Multiplier Post Place & Route Results
Y X
X = 0x03C80000534B9C1E
Y = 0x0250000277CB0D10
Densily Packed Decimal (DPD) Decoder
Leading Zeros Count / Decimal FixedPoint
Shift Left Amount Multipliplier
Computation
Left Shift Register
Exponent Carry Propagate
Computation Adder
RoundUp Detection Overflow / Underflow Correction
Rounding Unit
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = 0x03C80000534B9C1E
Y = 0x0250000277CB0D10
Densily Packed Decimal (DPD) Decoder
X = +0000001234567890 EXP156
Leading Zeros Count / Decimal FixedPoint Y = +0000009876543210 EXP250
Shift Left Amount Multipliplier X•Y = +12193263111263526900 EXP406
Computation
Left Shift Register
Exponent Carry Propagate
Computation Adder
RoundUp Detection Overflow / Underflow Correction
Rounding Unit
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
Exponent Carry Propagate
Computation Adder
RoundUp Detection Overflow / Underflow Correction
Rounding Unit
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
Rounding Unit
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
Rounding Unit
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
Exception Unit DPD Encoder
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
exception signals X•Y
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
Y X
X = +0000001234567890 EXP156
Y = +0000009876543210 EXP250
Densily Packed Decimal (DPD) Decoder
X•Y = +12193263111263526900 EXP406
decimal fixedpoint multiplier decimal fixedpoint multiplier
Ps Pc Ps Pc
shift register shift register CPA (2·p)
Qsu Qsl Qcu Qcl
decimal fixedpoint multiplier decimal fixedpoint multiplier
Ps Pc Ps Pc
shift register shift register CPA (2·p)
Qsu Qsl Qcu Qcl
decimal fixedpoint multiplier decimal fixedpoint multiplier
Ps Pc Ps Pc
shift register shift register CPA (2·p)
Qsu Qsl Qcu Qcl
DSP48E
saves LUTs DSP48E
●
ADD
Y(31:16) Y(15:0)
Post Place & Route Results
Decimal FixedPoint Multiplier with CPA output
● Xilinx Virtex5, speed grade 2
● up to 13 pipeline registers, configurable via VHDL generics
● 5350 – 6500 LUTs, 0 – 4900 FFs
● 5500 – 7600 combined LUTs and FFs
M. Baesler, S. Voigt, T. Teufel Decimal FloatingPoint Multiplier 09/01/2010
25/30
Introduction Decimal FixedPoint Multiplier Decimal FloatingPoint Multiplier Post Place & Route Results
Decimal FixedPoint Multiplier with CPA output
● Xilinx Virtex5, speed grade 2
● up to 13 pipeline registers, configurable via VHDL generics
● 5350 – 6500 LUTs, 0 – 4900 FFs
● 5350 – 7600 combined LUTs and FFs
M. Baesler, S. Voigt, T. Teufel Decimal FloatingPoint Multiplier 09/01/2010
25/30
Introduction Decimal FixedPoint Multiplier Decimal FloatingPoint Multiplier Post Place & Route Results
Decimal FloatingPoint Multiplier
Decimal FloatingPoint Multiplier
Decimal FloatingPoint Multiplier
Type1 Type2 Type3
mulbased shifting, muxbased shifting, muxbased shifting,
delayed CPA delayed CPA no delayed CPA
#LUTs 6300 8400 7900 9400 7500 9400
#FFs 0 4100 0 4500 0 4400
#(LUT + FFs) 6500 8400 8300 9300 7600 9600
#DSP48E 17 0 0
● approx. 70% of the LUTs are used by the fixedpoint
multiplier (for Type2 and Type3)
● medium Virtex5 XC5VLX110T:
80009000 LUTs ~ 11.5%13%
Comparison to binary floatingpoint multiplier
● 64 bit binary floatingpoint multiplier generated with CoreGen
● no DSP48E
● Type2 decimal vs. CoreGen binary multiplier
decimal binary decimal binary
max. frequency (MHz)
10000 400
number of LUTs
300
5000 200
100
0 0
0 3 6 9 0 3 6 9
number of pipeline registers number of pipeline registers
decimal mult. : 3.2 – 3.5 more LUTs binary mult. : 1.6 – 2.2 times faster
Comparison to binary floatingpoint multiplier
● 64 bit binary floatingpoint multiplier generated with CoreGen
● no DSP48E
● Type2 decimal vs. CoreGen binary multiplier
decimal binary decimal binary
max. frequency (MHz)
10000 400
number of LUTs
300
5000 200
100
0 0
0 3 6 9 0 3 6 9
number of pipeline registers number of pipeline registers
decimal mult. : 3.2 – 3.5 more LUTs binary mult. : 1.6 – 2.2 times faster
● decimal fixedpoint multiplier
– parallel, fully combinational
– configurable number of pipeline stages
● decimal floatingpoint multiplier
– configurable number of pipeline stages
– three different implementations
– tradeoff: area vs. speed
● future work: fully IEEE 7542008 compliant coprocessor