PDF

Improved S V D systolic array and implementation on FPGA
A. Ahmedsaid', A. Amira and A.Bouridane

School of Computer Science
The Queen: University Belfast, United Kingdom
a.ahmedsaid@qub.ac.uk
Abstract
This paper presents on eflcient systolic array fbr the
computation of the Singular Value Decomposition (SVD).
The proposed architecture is three times more eflcient
and faster than the Brent, Luk, Van Loan (BLV, SVD
systolic array. The architecture har been implemented
efliciently on FPGA using a high level langwrge for
hardware design "Handel-C".
1. Introduction
The SVD is one of the most important matrix
factorisations in linear algebra. Its applications vary from
beamfonning and source localization, to spect"
analysis, digital image processing, Principal Component
Analysis (PCA), Latent Semantic Analysis (LSA)... etc
~31.
The Brent, Luk, Van Loan (BLV) systolic a m y is

presented in section 3. Section 4 shows the main idea
behind the improvement of the BLV array and section 5 ,
the adaptation of the array for the computation of smgular
vectors. The implementation details and analysis are
given in section 6 and 7. Conclusion and future work are
given in section 8.
2. The Jacobi SVD algorithm

Equation (I) can be re\\litten in the following form:
A =UDV' QU'AV = D
(2)
The Jawbi method exploits (2) to generates the
matrices U and Vby perfonning a sequence of orthogonal
two sided plan rotations (2D rotations) to the input
matrix:
T
J: 44=4+,
(3)
With the property that each new matrix

The SVD of a matrix A E Rmxn is a factorisation of
the form:
A = UDV'
(1)
where U ER"""and V c R"""are orthogonal ( U T U = I
and
V T V = I ) andDER"""is
diagonal with
nonnegative diagonal elements U , . The numbers U , are
the 'singular values' of A. U = {U],..., },U
are the 'left
singular vectors' and , V = {Y I,...,v,}
are the 'right
singular vectors'.
There are many numerically stable algorithms for
computing the SVD such as the Jacobi algorithm, the QR
method and the one sided Hestenes method [3],[7]. For
parallel implementations. the Jacobi method is far
superior in tenns of simplicity, regularity, and local
communications. Brent, Luk and Van Loan [Z] have
shown how the parallel Jacobi algorithm can compute the
S V D of a square N x N matrix in o(N lop N) time. This
is to be com ared with the best serial algorithms, whch
complexity
have a o(N3,j'time
The rest of the paper is organized as follows. A
description of the Jacobi algorithm is given in section 2.
-35-
A, is 'more
diagonal' than its predecessor. After n iterations, the input

matrix A is transformed into the diagonal matrix A. :
4=J:'J;-:.. J ~ J ; ' ~ J : J;.]J;

..
which leads to:
(4)
D=A, ; V = n J , ! ; U = n J I
(5)
The matrices J,' &d J,! are 'Jacobi rotation' matrices
J b , q , B ) ofthe form:
r
(Where p < q ,
c = cos(@) and s =sin(@))
The angles 0 are chosen in order to solve a 2 x 2 SVD

problem:
["""q
-sin4
sin61
C
I,[*.
si4
-sin@, cog,
am],[""
ow am
j=kdmO]
(6)
By repeating t h ~ sfor all possible pairs @,q), A can be

effectively diagonalised:
Jacobi SVD algorithm

For s=1, ...,S
For p.1, ...,N
For q=p+l, ...,N
Begin
Determine 0,,8,(or cudsin);
A :=
end.
Jb,
A . Jb,
4.0.)
The Jacobi algorithm performs 2 x 2 SVD for all

possible pairs @,q). which is called a sweep, and then
repeats that for as many sweeps as necessary for the
convergence (loop s).
As we mentioned earlier, the multiplication by J I
affects only the two columns p and q, and the
multiplication by J ! affects only the two rows p and q.
Therefore if N is the matrix sue and if N is men then N/2
sub-problems ( 2 x 2 SVD) can be processed in parallel.
To illustrate this suppose N=4 and group the six possible
subproblems into three sets:
Sell: {(1,2),(3,4)]
sea: {(133),(L4)1
Sea: ((L4),(2,3))
All @,Q pairs within each set are non conflicting. Suhproblem (1,2) and (3,4) can be camed out in parallel,
likewise sub-problem (1,3) and (2,4) can be executed in
parallel as can sub-problem (1,4) and (2,3). This way of
ordering sub-problems is called parallel ordering it
allows the execution of N/2 sub-problems in parallel. The
number of sets is N-1 and a sweep consists of N-1 steps
of N/Z rotations executed in parallel.
Figure 1. BLV array, N.8. (Horizontalarrows:

transmission of left rotation parameter; vertical arrows:
transmission of right rotation parameter; Diagonal arrows:
Data interchange)
Initially a processor Pij holds a 2 x 2 sub-matrix
consists of the solution of a 2 x 2 SVD problem (6), this

is done by the diagonal processors which annihilate their
two off-diagonal elements and then the application of the
computed rotations to the matrix A (3), which is executed
by each processor applying mesided rotations to its
2 x 2 submatrix (9). After the application of the
rotations, the local matrices are interchanged (figure 1,2)
between processors for the execution of a new step. The
interchange algorithm insures the implementation of the
parallel ordering by putting data in the appropriate
processor according to the new set of @,q) sub-problems
indexes.
3. The BLV array

The BLV array is a s q y e systolic m a y that
implements a parallel Jacobi SVD algorithm using (N/2)
processors. It perfoms N/2 sub-problem in parallel, and a
sweep in N-l steps. As the number of sweeps needed for
the SVD of a N x N matrix is o(1ogN) [Z], this systolic
array is capable of processing the SVD of a square matrix
in O ( N l o g ( ~ ) time.
)
Each processor holds a 2 x 2 sub
matrix of A to whch it applies two sided plan rotations.
The plan rotations are computed by the diagonal
processors, and sent to the off-diagonal processors in the
same IOW and in the same column (figure 1).
Figure 2. Processors data input and output lines

To avoid broadcasting the rotation parameters (0,,0,
or codsin) at conslant time along processor rows and
columns, the rotation parameters are fmnsmitted at
constant speed between adjacent processors. Let
A , = /i - jl denote the distance of processor Pq from the
diagonal. The operation of processor Pij will be delayed
by A # time units relative to the operation ofthe diagonal
processors, in order to allow time for the rotation
parameters to be propagated at unit speed along each row
and column of the processor m a y . A processor cannot
commence a rotation until data from earlier rotations are
available on all its lines. Thus, processor Pq needs data
-36-
from its four neighbors P,+l,,.tl

(l<ij<n/2). Since
New algorithm processor

If i = j {diagonalprocessor)
/ A r -Ai+l,je,/ S 2, it is sufficient for Pij to be idle for
Solve 2 x 2 SVD
Output rotation parameters
Apply rotations to sub-matrix
Output data ( 2 x 2 sub-matnx)
Wait for new data
Else {off-diagonal processor)
Wait for new rotation parameters
Apply rotations to submatrix
Output data ( 2 x 2 submatrix)
Wait for new data
to
two time steps while waiting for the processors Pj+,,,il
complete their (possibly delayed) steps. Thus the price
paid to avoid broadcasting is that each processor is active
for only one third of the total computation (figure 3) [2].
In the next section we show a new method of
synchronizing processors operations to increase the
efficiency of t h ~ systolic
s
array.
4. Improvement of the efficiency of the BLV

array
The authors in [2] suggested that the efficiency can be
increased to almost unity if multiple problems are
interleaved or if the rotation parameters are propagated at
greater than unit speed. What is proposed is to
synchronize the processors according to availability of
new data instead of a time step synchronization strategy
(algorithm processor 121):
In this new algorithm a processor starts working as

soon as new data appears in its input lines (figure 4,s).
The operations are dah driven rather than steps
synchronized [2] (figure 3).
Algorithm processor
If T 2 A and T-A~O(mod3)then
Begin
If T # A then read new 2 x 2 matrix
If A = 0 then (diagonal processor)
Solve 2 x 2 SVD
Apply rotations to submabix

Else
Read new rotation parameters
Apply rotationsto submatix
If i > j then set ourp
If i < j then set outy
End
Else if T t A and T-A zl(mod3) men

Begin
If i = l or j = l thenset oufa
If i = n/2 or j = n/2 then set out6
End
Elseif T t A and T-Az2(mod3) then
Begin
Ifi>land j>lthensetoufa
If i < n/2 or j < n/2 then set out6
If i 9 j then set outp
If i 2 j then set our y
End
how the
computations are staggered to avoid broadcasting.
The value inside each box indicates the iteration
number.
BLV amy
COmpUfatiOn
timofone
stsp
Improvsm
Ropossd
T=ma%2p,.d
T = T2,2,,*
+Td,T,)
"(TrdzTm)
cnt
~00000000000
aa000000000i3
dficimcy
E " 0
h&00
0BU0000~0000
00B000 '000000
000000'000000
~putntjon
tuneofthe
&canposition
-38
1/3
About the
same
3 times
bsm
CPT =(3S(N -l)+ CPT = S ( N - I ) T +
(N/2-l)+3)T
T ~ d ~ N l 2 - 1 3timss
faster
Equation 7 suggests that in order to implement the

algorithm conectly the interchange algorithm must be
For this reasons, it is the matrix U'
applied to V and 6.
which is accumulated rather than (i:
The resulting x,
UT =Jb,q,QJUT
(8)
The application of the rotations to the housed submatrices is described below:
In our implementation we used the method described in

[3] with the additional constraint to keep the Volder
sequence
( i = O,..,m-I)
for
simplicity
and
parameterisation reasons. This method consist of
correcting the scale factor A, by a sequence of n shiftadd operation:
E;
t ] = [ m-sine,
"
sci a4a
7.kt1.-"p
4. P
. 4. P. cad,
[ry8, ]={yw a, ].[-sin@,
si&,
coss,
si4
(11)
-sinB, sCO@
i4
implementation
of
plan-
6.1. The CORDIC algorithm

Research advances have encouraged the use of specialpurpose arithmetic techniques, which maps more
effectively the algorithm into hardware. One of the most
popular is the Coordinate Rotation Digital Computer
(CORDIC) invented by J. Volder in 1959 [6]. Initially
Volder developed the CORDIC algorithm for computing
the rotation of a vector in a Cartesian coordinate system
and evaluating the length and angle of a vector. The
CORDIC method was later expanded for multiplication,
division, logarithm, exponential and hyperbolic functions.
The various function computations were summarized into
a d i e d technique in [7].
f i f+ l
2-2' which must be corrected if necessary.
i=O
.(/b-'U')= 1+ &
2-'(0)h(1+
(10)
codj
k ;]=p 1.k];
6. Hardware
rotations
(9)
A, =
(12)
j=1
Where T(j) are integers, &)=*I

and AA, is the
remaining mor. These scaling iterations are
parameterised by the set of signed integers
~ ( o ~ s ( ~ ~ ( ~ X . . , s ( . Using
~ ( ~ ) }a. computer program we
generated the sequence of shifl-add operation for a given
1,-34,+35,precision: { 1,+2,-5,+9,+10,+16,,23,+27,-28,+3
39,+41,-45).
For example if m = 20 the sequence is {1,+2,5,+9,+10,+16) and AAm < r Z If
3m.= 24 the sequence
and h4, < 2-21.
is (1,+2,-5,+9,+10,+16,-23)
6.2. The TPR method

The diagonalization of a 2 x 2 matrix can done by
different method as shown in [2], hut in [3] B. Yang and
J.F. Bohme have presented a method called TPR (Two
Plan Rotation) that solves the 2 x 2 SVD problem and
rotate the input matrix at the same with a reduced
computations number.
Any 2 x 2 matrix A =
I,.
can be decomposed
[ai2
CORDIC iteration
For i=O,..,m-l
are scaled by a factor
and y,
into the sum of two matrices such as:
Begin
xjtl =.x, - d,y2-'

y,,, = y, + d,x2-'
z,+, = zi - d , tai-'(2-')
Pi -41
-P2 qz
A = A ~ + A ~ = [ 4 Pll+[12
i
P21
The matrix A is then rotated into B as follows:
(13)
End
d i represents the rotation direction and is chosen
according to the CORDIC mode: In the 'rotation mode',
the input vector is rotated by an angle zo = B and
d i = -1 if zi < 0 ,+I otherwise , And in the 'vectoring

mode', the input vector is rotated in order to minimize its
y component by choosing d, = +I if y; < 0 ,-I otherwise
and leads to z, = zo + 0 = zo +tan-'
b/x)
This means a two-sided rotation can be performed by

only two plan rotations (one computes 11 and tl and
another one computes r2 and t2) and the solution of a
2 x 2 SVD problem with the computation of the
diagonalised matrix can be done by only two plan
rotations (one zeros tl and another one zeros tz).
-39 -
bv column from the input interface via the array. of pmcesso~

to
.
the output
.....interface
..........(figure
.........6).
.........................
7. Implementation of the improved array

The implemented ssytolic m a y computes both the
singular values and singular vectors. A high level
language for hardware design 'HandelLC' [9] has been
used for the implementation. The design is completely
parametric, the matrix s u e and the world length can he
easily changed and if the singular vectors are not wanted
or only the right or the le!? singular vectors are wanted the
design can be modified with minor efforts.
7.1 design environment

The development platform used is the Celoxica DK1
that is a Handel-C compiler-debugger with its
accompanying RClOOO board. The RClOOO board is a
PCI card equipped with a Xillnx XCVZOOOE-6 Virtex-E
FFGA chip. It has 8IvBytes of SRAM directly connected
to the FFGA in four 32-bit wide memoly banks. All are
accessible by the FPGA and any device on the PCI bus.
The RClOOO hoard is supported with a l i b r w that
simplifies the FPGA confguration tiles and data transfer
between the host PC and the RCIOOO board [9].
7.2 SVD chip

The design compises a "Control Unit", an "input interface",
an "output interface" and the "SVD may" (figure 7 ) . The
cnntml unit generates commands to the processon and handles
all the IO operations between the FPGA and the off-chip
memory and eventually the host F C The data matices are
readhitten fromito the off-chip memory element by element,
which makes the IO requirement independent from the matrix
size.
M O " .
Figure 7. FPGA implementation of the SVD array

(N=6)
Processors structure. The diagonalisation is performed
by CORDIC modules working in the vectoring mode, and
the matrix rotations are performed by CORDIC modules
working in the rotation mode (details of CORDIC
arithmetic can be found in [3,4,5]). Because of area
limitation we didn't take advantage of the full parallelism,
therefore the rotation operations are performed using nnly
one CORDIC rotator in each processor. In addition, the
diagonal processors contain two CORDIC modules used
for a parallel computation of the angles OZ and 0, using
the TPR method (14). The parallel computation of the
rotation angles improves the efficiency of the
c
architecture, as the off-diagonal processors, ~
represent a large percentage of the area, stay idle during
this step. In order to reduce the area, the CORDIC
are
rotation sequences ( d , ) corresponding to R, and
extracted inside the diagonal processors and sent serially
to the off-diagonal processors. By doing this, all lookup
tables are removed from the CORDIC rotators modules as
well as the registers and adderslsubstractors used for
angles accumulation. For each processors, the scale factor
correction is performed by a single pipelined module.
The processors operations sequence is summanzed in
figure 8.
e,
Solve a 2 SVD
somwfafion and
Word seral
"t
output
ratatlo"
Figure 6. IO phase (N=6)

The c o n " & can be an instruction for IO phase, processing of
a new iteration or other commands that can be used in future
modifications.The input and output interfaces we basically shift
buffers used to transfer data and commands serially from the
control unit to the systolic m y and output data from the
systolic array to the control unit. The data is then shifted column
Of
malm v
rcseLfiDn and
mtstion 01 matrix v
ICR rotetion 0 7 matr,x A
Figure 8. Processors operations
-40
To implement the proposed algorithm, a Handel-C

mechamism for inter-process communication called
Channel has been used (figure 9). When a process
wiles to a channel [9],[10], it waits until the destination
process reads from it. Likewise, when a process reads
from a channel it waits until the source process writes to
it. Using these channels the implementation of the
algorithm is very easy: each processor has a 2 x 2 m a y
of channels fmm which it reads data \wilten by other
processors For the singular vectors sub-matrices two
additional 2 x 2 m a y s can be used.
and its minimal value is 65.6 % for a diagonal processor

and 88.5 % for an off-diagonal processor.
.-.
IIY(
Figure 10. Efficiency of the processors

The average efticiency for the entire m a y is:
Where Adp is the area consumed by a diagonal

m n
processor, and A ,
Figure 9. Channels wmmunication

The computation time of one step is:
T = 23W +Tsc+T, + 11 1
is the area consumed by an off-
diagonal processor. Adp = 3A, because the diagonal

processors contain three CORDIC modules while the rest
of the processors have just one.
(15)
Where W is the word length, T. the time for matrices

interchange between off-diagonal processors, and is equal
to one if the matrix size is 4 else it is equal to two. T,c is
the latency of the scale factor correction module:
wog
la
Latency
(wW
<I6
16-22
23-26
27
28-30
31-33
10
11
12
I3
I
.
_
Figure 11. Efficiency o f t h e SVD array ( W 2 4 )
A diagonal processor stays idle for 8W + 40 clock

cycles whereas an off-diagonal processor stays idle for
3P + 3 clock cycles. If we defme the efficiency as the
time a processor actually work during a step then:
Pop =
T -(3W +3) - 2OW +T,

T
23W +T,
+c+ 108
+ T, + I
11
Figure 1 1 shows that the efficiency increases with the

matrix size from 75 % to reach its limit p e p .Note that if
such a rigorous computation is applied to the original

BLV array, the efficiency would be less than 33% as a
processor may stay idle for some time during a working
step.
7.3 Implementation results

(17)
Table 2,3 shows the implementation results for a

XCV2000E-6 FFGA target:
Where pdp and pop are respectively the efficiencies of

diagonal and off-diagonal processors. Figure 10 shows
that the efficiency of the processors is almost constant
-41 -
TJhle 2 IksiLn stnustici furnutns w c o i 6\6
8. Conclusion and Future work
The maximum frequency is fluctuating around 88 MHz

and the area vanes linearly with the word length This is
due to the fact that multipliers (which have long delays
and an O(w) area complexity) havent been used (figure
12).
Table 3: Influence of the matrix size
The area increases quickly with matrix size (figure

(figure 13).
Thats because the army uses (N/2)2 processors.
processiors.
4 . . . ..-...!
Figure 12. Area vs word length
Thts paper presented an improved SVD systolic array

Van Loan SVD array. The
based on the Brent ,I,&
improved array is three times more efficient and fasted
than the BLV may. The adaptation of the systolic array
for the computation of singular vectors, and an efficient
FFGA implementation using a High-level language have
been shown. The obtained design is parametric flexible
and can be adapted to any application. We didnt fmd
FPGA implementations of the S V D in the literature to
compare with,so we believe its the fust one. Finally, we
point out that the obtained results for the BLV SVD array
benefit
directly to the symmetric eigenvalue
decomposition systolic m y , whch is almost identical to
the SVD array [I]. W s will be investigated in a fume
work.
9. References
[I] R.P. Brent, and F.T. Luk, The solution of singularvalue and symmetric eigenvalue problems on
multiprocessor arrays SIAM J. SCI. STAT. COMPLT,
vol.6, no. 1, pp. 69-84, January. 1985.
[2] R.P. Brent, F.T. Luk, and C. Van Loan
Computation of singular value decomposition using
mesh-connected processors J. =SI. Comput Syst, vol.
1, no. 3,pp. 242-270, 1985.
(31 B. Yang, and J.F. Bobme, Reducing the
computations of the singular value decomposition m a y
given by Brent and Luk SIAM J. Matrix. Anal. Appl,
vol. 12, no. 4,pp. 713-725, October. 1991.
[4] 1.R Cavallaro, F.T. Luk, CORDIC arithmetic for
an SVD processor Joumal of parallel and distributed
computing, vol. 5, pp. 271-290, 1988.
[5] Ray Andraka, A s w e y of CORDIC algorithms for
FFGA based computers in ACMISIGDA sixth
international symposium on field programmable gate
arrays, Motery, California, United states, 1998, pp. 191200.
[6] J.Volder. The CORDIC Computing Techruque, IRE
Trans. Comput., Sept. 1959,pp.330-334.
[7] J.S. Walther. A Unified Algorithm for elementary
functions. Roc. M I P S Spring Joint Computer
Conference, pp.379-385,1971.
[8] G.H. Golub, and C.F. Van Loan, Matrix
computations, The Johns Hopkins University Press,
London, 1996.
[9] www.Celoxica.com.
[lo] Celoxica application note, tbe technology behind
DKI,AN 18V1.0, 2001.
_.^
Figure 13. Area vs matrix size
-42

PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF

Uploaded by

Copyright:

Available Formats

Improved S V D systolic array and implementation on FPGA

A. Ahmedsaid', A. Amira and A.Bouridane

The Brent, Luk, Van Loan (BLV) systolic a m y is

2. The Jacobi SVD algorithm

With the property that each new matrix

diagonal' than its predecessor. After n iterations, the input

4=J:'J;-:.. J ~ J ; ' ~ J : J;.]J;

The matrices J,' &d J,! are 'Jacobi rotation' matrices

c = cos(@) and s =sin(@))

The angles 0 are chosen in order to solve a 2 x 2 SVD

By repeating t h ~ sfor all possible pairs @,q), A can be

Jacobi SVD algorithm

The Jacobi algorithm performs 2 x 2 SVD for all

Figure 1. BLV array, N.8. (Horizontalarrows:

consists of the solution of a 2 x 2 SVD problem (6), this

3. The BLV array

Figure 2. Processors data input and output lines

from its four neighbors P,+l,,.tl

New algorithm processor

/ A r -Ai+l,je,/ S 2, it is sufficient for Pij to be idle for

4. Improvement of the efficiency of the BLV

In this new algorithm a processor starts working as

Apply rotations to submabix

Else if T t A and T-A zl(mod3) men

CPT =(3S(N -l)+ CPT = S ( N - I ) T +

Equation 7 suggests that in order to implement the

In our implementation we used the method described in

6.1. The CORDIC algorithm

2-2' which must be corrected if necessary.

Where T(j) are integers, &)=*I

6.2. The TPR method

are scaled by a factor

into the sum of two matrices such as:

xjtl =.x, - d,y2-'

d i = -1 if zi < 0 ,+I otherwise , And in the 'vectoring

This means a two-sided rotation can be performed by

bv column from the input interface via the array. of pmcesso~

7. Implementation of the improved array

7.1 design environment

7.2 SVD chip

Figure 7. FPGA implementation of the SVD array

Figure 6. IO phase (N=6)

ICR rotetion 0 7 matr,x A

Figure 8. Processors operations

To implement the proposed algorithm, a Handel-C

and its minimal value is 65.6 % for a diagonal processor

Figure 10. Efficiency of the processors

Where Adp is the area consumed by a diagonal

Figure 9. Channels wmmunication

is the area consumed by an off-

diagonal processor. Adp = 3A, because the diagonal

Where W is the word length, T. the time for matrices

Figure 11. Efficiency o f t h e SVD array ( W 2 4 )

A diagonal processor stays idle for 8W + 40 clock

T -(3W +3) - 2OW +T,

Figure 1 1 shows that the efficiency increases with the

such a rigorous computation is applied to the original

7.3 Implementation results

Table 2,3 shows the implementation results for a

Where pdp and pop are respectively the efficiencies of

TJhle 2 IksiLn stnustici furnutns w c o i 6\6

8. Conclusion and Future work

The maximum frequency is fluctuating around 88 MHz