You are on page 1of 8

Improved S V D systolic array and implementation on FPGA

A. Ahmedsaid', A. Amira and A.Bouridane


School of Computer Science
The Queen: University Belfast, United Kingdom
a.ahmedsaid@qub.ac.uk
Abstract
This paper presents on eflcient systolic array fbr the
computation of the Singular Value Decomposition (SVD).
The proposed architecture is three times more eflcient
and faster than the Brent, Luk, Van Loan (BLV, SVD
systolic array. The architecture har been implemented
efliciently on FPGA using a high level langwrge for
hardware design "Handel-C".

1. Introduction
The SVD is one of the most important matrix
factorisations in linear algebra. Its applications vary from
beamfonning and source localization, to spect"
analysis, digital image processing, Principal Component
Analysis (PCA), Latent Semantic Analysis (LSA)... etc
~31.

The Brent, Luk, Van Loan (BLV) systolic a m y is


presented in section 3. Section 4 shows the main idea
behind the improvement of the BLV array and section 5 ,
the adaptation of the array for the computation of smgular
vectors. The implementation details and analysis are
given in section 6 and 7. Conclusion and future work are
given in section 8.

2. The Jacobi SVD algorithm


Equation (I) can be re\\litten in the following form:
A =UDV' QU'AV = D
(2)
The Jawbi method exploits (2) to generates the
matrices U and Vby perfonning a sequence of orthogonal
two sided plan rotations (2D rotations) to the input

matrix:
T

J: 44=4+,

(3)

With the property that each new matrix


The SVD of a matrix A E Rmxn is a factorisation of
the form:
A = UDV'
(1)
where U ER"""and V c R"""are orthogonal ( U T U = I
and
V T V = I ) andDER"""is
diagonal with
nonnegative diagonal elements U , . The numbers U , are
the 'singular values' of A. U = {U],..., },U
are the 'left
singular vectors' and , V = {Y I,...,v,}
are the 'right
singular vectors'.
There are many numerically stable algorithms for
computing the SVD such as the Jacobi algorithm, the QR
method and the one sided Hestenes method [3],[7]. For
parallel implementations. the Jacobi method is far
superior in tenns of simplicity, regularity, and local
communications. Brent, Luk and Van Loan [Z] have
shown how the parallel Jacobi algorithm can compute the
S V D of a square N x N matrix in o(N lop N) time. This
is to be com ared with the best serial algorithms, whch
complexity
have a o(N3,j'time
The rest of the paper is organized as follows. A
description of the Jacobi algorithm is given in section 2.

-35-

A, is 'more

diagonal' than its predecessor. After n iterations, the input


matrix A is transformed into the diagonal matrix A. :

4=J:'J;-:.. J ~ J ; ' ~ J : J;.]J;


..
which leads to:

(4)

D=A, ; V = n J , ! ; U = n J I

(5)

The matrices J,' &d J,! are 'Jacobi rotation' matrices

J b , q , B ) ofthe form:
r

(Where p < q ,

c = cos(@) and s =sin(@))

The angles 0 are chosen in order to solve a 2 x 2 SVD


problem:

["""q

-sin4

sin61
C

I,[*.

si4
-sin@, cog,

am],[""

ow am

j=kdmO]
(6)

By repeating t h ~ sfor all possible pairs @,q), A can be


effectively diagonalised:

Jacobi SVD algorithm


For s=1, ...,S
For p.1, ...,N
For q=p+l, ...,N

Begin
Determine 0,,8,(or cudsin);

A :=
end.

Jb,

A . Jb,
4.0.)

The Jacobi algorithm performs 2 x 2 SVD for all


possible pairs @,q). which is called a sweep, and then
repeats that for as many sweeps as necessary for the
convergence (loop s).
As we mentioned earlier, the multiplication by J I
affects only the two columns p and q, and the
multiplication by J ! affects only the two rows p and q.
Therefore if N is the matrix sue and if N is men then N/2
sub-problems ( 2 x 2 SVD) can be processed in parallel.
To illustrate this suppose N=4 and group the six possible
subproblems into three sets:
Sell: {(1,2),(3,4)]
sea: {(133),(L4)1
Sea: ((L4),(2,3))
All @,Q pairs within each set are non conflicting. Suhproblem (1,2) and (3,4) can be camed out in parallel,
likewise sub-problem (1,3) and (2,4) can be executed in
parallel as can sub-problem (1,4) and (2,3). This way of
ordering sub-problems is called parallel ordering it
allows the execution of N/2 sub-problems in parallel. The
number of sets is N-1 and a sweep consists of N-1 steps
of N/Z rotations executed in parallel.

Figure 1. BLV array, N.8. (Horizontalarrows:


transmission of left rotation parameter; vertical arrows:
transmission of right rotation parameter; Diagonal arrows:
Data interchange)
Initially a processor Pij holds a 2 x 2 sub-matrix

consists of the solution of a 2 x 2 SVD problem (6), this


is done by the diagonal processors which annihilate their
two off-diagonal elements and then the application of the
computed rotations to the matrix A (3), which is executed
by each processor applying mesided rotations to its
2 x 2 submatrix (9). After the application of the
rotations, the local matrices are interchanged (figure 1,2)
between processors for the execution of a new step. The
interchange algorithm insures the implementation of the
parallel ordering by putting data in the appropriate
processor according to the new set of @,q) sub-problems
indexes.

3. The BLV array


The BLV array is a s q y e systolic m a y that
implements a parallel Jacobi SVD algorithm using (N/2)
processors. It perfoms N/2 sub-problem in parallel, and a
sweep in N-l steps. As the number of sweeps needed for
the SVD of a N x N matrix is o(1ogN) [Z], this systolic
array is capable of processing the SVD of a square matrix
in O ( N l o g ( ~ ) time.
)
Each processor holds a 2 x 2 sub
matrix of A to whch it applies two sided plan rotations.
The plan rotations are computed by the diagonal
processors, and sent to the off-diagonal processors in the
same IOW and in the same column (figure 1).

Figure 2. Processors data input and output lines


To avoid broadcasting the rotation parameters (0,,0,
or codsin) at conslant time along processor rows and
columns, the rotation parameters are fmnsmitted at
constant speed between adjacent processors. Let
A , = /i - jl denote the distance of processor Pq from the
diagonal. The operation of processor Pij will be delayed
by A # time units relative to the operation ofthe diagonal
processors, in order to allow time for the rotation
parameters to be propagated at unit speed along each row
and column of the processor m a y . A processor cannot
commence a rotation until data from earlier rotations are
available on all its lines. Thus, processor Pq needs data

-36-

from its four neighbors P,+l,,.tl


(l<ij<n/2). Since

New algorithm processor


If i = j {diagonalprocessor)

/ A r -Ai+l,je,/ S 2, it is sufficient for Pij to be idle for

Solve 2 x 2 SVD
Output rotation parameters
Apply rotations to sub-matrix
Output data ( 2 x 2 sub-matnx)
Wait for new data
Else {off-diagonal processor)
Wait for new rotation parameters
Output rotation parameters
Apply rotations to submatrix
Output data ( 2 x 2 submatrix)
Wait for new data

to
two time steps while waiting for the processors Pj+,,,il
complete their (possibly delayed) steps. Thus the price
paid to avoid broadcasting is that each processor is active
for only one third of the total computation (figure 3) [2].
In the next section we show a new method of
synchronizing processors operations to increase the
efficiency of t h ~ systolic
s
array.

4. Improvement of the efficiency of the BLV


array
The authors in [2] suggested that the efficiency can be
increased to almost unity if multiple problems are
interleaved or if the rotation parameters are propagated at
greater than unit speed. What is proposed is to
synchronize the processors according to availability of
new data instead of a time step synchronization strategy
(algorithm processor 121):

In this new algorithm a processor starts working as


soon as new data appears in its input lines (figure 4,s).
The operations are dah driven rather than steps
synchronized [2] (figure 3).

Algorithm processor
If T 2 A and T-A~O(mod3)then

Begin
If T # A then read new 2 x 2 matrix
If A = 0 then (diagonal processor)
Solve 2 x 2 SVD

Apply rotations to submabix


Else
Read new rotation parameters
Apply rotationsto submatix
Output rotation parameters
If i > j then set ourp
If i < j then set outy
End

Else if T t A and T-A zl(mod3) men


Begin
If i = l or j = l thenset oufa
If i = n/2 or j = n/2 then set out6
End
Elseif T t A and T-Az2(mod3) then
Begin
Ifi>land j>lthensetoufa
If i < n/2 or j < n/2 then set out6
If i 9 j then set outp
If i 2 j then set our y
End

how the
computations are staggered to avoid broadcasting.
The value inside each box indicates the iteration
number.

BLV amy
COmpUfatiOn

timofone
stsp

Improvsm

Ropossd

T=ma%2p,.d

T = T2,2,,*

+Td,T,)

"(TrdzTm)

cnt

~00000000000

aa000000000i3

dficimcy

E " 0
h&00
0BU0000~0000
00B000 '000000
000000'000000

~putntjon

tuneofthe
&canposition

-38

1/3

About the
same

3 times
bsm

CPT =(3S(N -l)+ CPT = S ( N - I ) T +

(N/2-l)+3)T

T ~ d ~ N l 2 - 1 3timss
faster

Equation 7 suggests that in order to implement the


algorithm conectly the interchange algorithm must be
For this reasons, it is the matrix U'
applied to V and 6.
which is accumulated rather than (i:

The resulting x,

UT =Jb,q,QJUT
(8)
The application of the rotations to the housed submatrices is described below:

In our implementation we used the method described in


[3] with the additional constraint to keep the Volder
sequence
( i = O,..,m-I)
for
simplicity
and
parameterisation reasons. This method consist of
correcting the scale factor A, by a sequence of n shiftadd operation:

E;

t ] = [ m-sine,
"
sci a4a

7.kt1.-"p

4. P
. 4. P. cad,
[ry8, ]={yw a, ].[-sin@,

si&,
coss,

si4

(11)

-sinB, sCO@
i4

implementation

of

plan-

6.1. The CORDIC algorithm


Research advances have encouraged the use of specialpurpose arithmetic techniques, which maps more
effectively the algorithm into hardware. One of the most
popular is the Coordinate Rotation Digital Computer
(CORDIC) invented by J. Volder in 1959 [6]. Initially
Volder developed the CORDIC algorithm for computing
the rotation of a vector in a Cartesian coordinate system
and evaluating the length and angle of a vector. The
CORDIC method was later expanded for multiplication,
division, logarithm, exponential and hyperbolic functions.
The various function computations were summarized into
a d i e d technique in [7].

f i f+ l

2-2' which must be corrected if necessary.

i=O

.(/b-'U')= 1+ &

2-'(0)h(1+

(10)

codj

k ;]=p 1.k];
6. Hardware
rotations

(9)

A, =

(12)

j=1

Where T(j) are integers, &)=*I


and AA, is the
remaining mor. These scaling iterations are
parameterised by the set of signed integers
~ ( o ~ s ( ~ ~ ( ~ X . . , s ( . Using
~ ( ~ ) }a. computer program we
generated the sequence of shifl-add operation for a given
1,-34,+35,precision: { 1,+2,-5,+9,+10,+16,,23,+27,-28,+3
39,+41,-45).
For example if m = 20 the sequence is {1,+2,5,+9,+10,+16) and AAm < r Z If
3m.= 24 the sequence
and h4, < 2-21.
is (1,+2,-5,+9,+10,+16,-23)

6.2. The TPR method


The diagonalization of a 2 x 2 matrix can done by
different method as shown in [2], hut in [3] B. Yang and
J.F. Bohme have presented a method called TPR (Two
Plan Rotation) that solves the 2 x 2 SVD problem and
rotate the input matrix at the same with a reduced
computations number.
Any 2 x 2 matrix A =

I,.

can be decomposed

[ai2

CORDIC iteration
For i=O,..,m-l

are scaled by a factor

and y,

into the sum of two matrices such as:

Begin

xjtl =.x, - d,y2-'


y,,, = y, + d,x2-'
z,+, = zi - d , tai-'(2-')

Pi -41
-P2 qz
A = A ~ + A ~ = [ 4 Pll+[12
i
P21
The matrix A is then rotated into B as follows:

(13)

End
d i represents the rotation direction and is chosen
according to the CORDIC mode: In the 'rotation mode',
the input vector is rotated by an angle zo = B and

d i = -1 if zi < 0 ,+I otherwise , And in the 'vectoring


mode', the input vector is rotated in order to minimize its
y component by choosing d, = +I if y; < 0 ,-I otherwise
and leads to z, = zo + 0 = zo +tan-'

b/x)

This means a two-sided rotation can be performed by


only two plan rotations (one computes 11 and tl and
another one computes r2 and t2) and the solution of a
2 x 2 SVD problem with the computation of the
diagonalised matrix can be done by only two plan
rotations (one zeros tl and another one zeros tz).

-39 -

bv column from the input interface via the array. of pmcesso~


to
.
the output
.....interface
..........(figure
.........6).
.........................

7. Implementation of the improved array


The implemented ssytolic m a y computes both the
singular values and singular vectors. A high level
language for hardware design 'HandelLC' [9] has been
used for the implementation. The design is completely
parametric, the matrix s u e and the world length can he
easily changed and if the singular vectors are not wanted
or only the right or the le!? singular vectors are wanted the
design can be modified with minor efforts.

7.1 design environment


The development platform used is the Celoxica DK1
that is a Handel-C compiler-debugger with its
accompanying RClOOO board. The RClOOO board is a
PCI card equipped with a Xillnx XCVZOOOE-6 Virtex-E
FFGA chip. It has 8IvBytes of SRAM directly connected
to the FFGA in four 32-bit wide memoly banks. All are
accessible by the FPGA and any device on the PCI bus.
The RClOOO hoard is supported with a l i b r w that
simplifies the FPGA confguration tiles and data transfer
between the host PC and the RCIOOO board [9].

7.2 SVD chip


The design compises a "Control Unit", an "input interface",
an "output interface" and the "SVD may" (figure 7 ) . The
cnntml unit generates commands to the processon and handles
all the IO operations between the FPGA and the off-chip
memory and eventually the host F C The data matices are
readhitten fromito the off-chip memory element by element,
which makes the IO requirement independent from the matrix
size.

M O " .

Figure 7. FPGA implementation of the SVD array


(N=6)
Processors structure. The diagonalisation is performed
by CORDIC modules working in the vectoring mode, and
the matrix rotations are performed by CORDIC modules
working in the rotation mode (details of CORDIC
arithmetic can be found in [3,4,5]). Because of area
limitation we didn't take advantage of the full parallelism,
therefore the rotation operations are performed using nnly
one CORDIC rotator in each processor. In addition, the
diagonal processors contain two CORDIC modules used
for a parallel computation of the angles OZ and 0, using
the TPR method (14). The parallel computation of the
rotation angles improves the efficiency of the
c
architecture, as the off-diagonal processors, ~
represent a large percentage of the area, stay idle during
this step. In order to reduce the area, the CORDIC
are
rotation sequences ( d , ) corresponding to R, and
extracted inside the diagonal processors and sent serially
to the off-diagonal processors. By doing this, all lookup
tables are removed from the CORDIC rotators modules as
well as the registers and adderslsubstractors used for
angles accumulation. For each processors, the scale factor
correction is performed by a single pipelined module.
The processors operations sequence is summanzed in
figure 8.

e,

Solve a 2 SVD

somwfafion and

Word seral

"t

output

ratatlo"

Figure 6. IO phase (N=6)


The c o n " & can be an instruction for IO phase, processing of
a new iteration or other commands that can be used in future
modifications.The input and output interfaces we basically shift
buffers used to transfer data and commands serially from the
control unit to the systolic m y and output data from the
systolic array to the control unit. The data is then shifted column

Of

malm v

rcseLfiDn and

mtstion 01 matrix v

ICR rotetion 0 7 matr,x A

Figure 8. Processors operations

-40

To implement the proposed algorithm, a Handel-C


mechamism for inter-process communication called
Channel has been used (figure 9). When a process
wiles to a channel [9],[10], it waits until the destination
process reads from it. Likewise, when a process reads
from a channel it waits until the source process writes to
it. Using these channels the implementation of the
algorithm is very easy: each processor has a 2 x 2 m a y
of channels fmm which it reads data \wilten by other
processors For the singular vectors sub-matrices two
additional 2 x 2 m a y s can be used.

and its minimal value is 65.6 % for a diagonal processor


and 88.5 % for an off-diagonal processor.

.-.

IIY(

Figure 10. Efficiency of the processors


The average efticiency for the entire m a y is:

Where Adp is the area consumed by a diagonal


m n

processor, and A ,

Figure 9. Channels wmmunication


The computation time of one step is:

T = 23W +Tsc+T, + 11 1

is the area consumed by an off-

diagonal processor. Adp = 3A, because the diagonal


processors contain three CORDIC modules while the rest
of the processors have just one.
(15)

Where W is the word length, T. the time for matrices


interchange between off-diagonal processors, and is equal
to one if the matrix size is 4 else it is equal to two. T,c is
the latency of the scale factor correction module:

wog

la
Latency

(wW

<I6

16-22

23-26

27

28-30

31-33

10

11

12

I3

I
.
_

Figure 11. Efficiency o f t h e SVD array ( W 2 4 )

A diagonal processor stays idle for 8W + 40 clock


cycles whereas an off-diagonal processor stays idle for
3P + 3 clock cycles. If we defme the efficiency as the
time a processor actually work during a step then:

Pop =

T -(3W +3) - 2OW +T,


T
23W +T,

+c+ 108
+ T, + I

11

Figure 1 1 shows that the efficiency increases with the


matrix size from 75 % to reach its limit p e p .Note that if

such a rigorous computation is applied to the original


BLV array, the efficiency would be less than 33% as a
processor may stay idle for some time during a working
step.

7.3 Implementation results


(17)

Table 2,3 shows the implementation results for a


XCV2000E-6 FFGA target:

Where pdp and pop are respectively the efficiencies of


diagonal and off-diagonal processors. Figure 10 shows
that the efficiency of the processors is almost constant

-41 -

TJhle 2 IksiLn stnustici furnutns w c o i 6\6

8. Conclusion and Future work

The maximum frequency is fluctuating around 88 MHz


and the area vanes linearly with the word length This is
due to the fact that multipliers (which have long delays
and an O(w) area complexity) havent been used (figure
12).

Table 3: Influence of the matrix size

The area increases quickly with matrix size (figure


(figure 13).
Thats because the army uses (N/2)2 processors.
processiors.

4 . . . ..-...!
Figure 12. Area vs word length

Thts paper presented an improved SVD systolic array


Van Loan SVD array. The
based on the Brent ,I,&
improved array is three times more efficient and fasted
than the BLV may. The adaptation of the systolic array
for the computation of singular vectors, and an efficient
FFGA implementation using a High-level language have
been shown. The obtained design is parametric flexible
and can be adapted to any application. We didnt fmd
FPGA implementations of the S V D in the literature to
compare with,so we believe its the fust one. Finally, we
point out that the obtained results for the BLV SVD array
benefit
directly to the symmetric eigenvalue
decomposition systolic m y , whch is almost identical to
the SVD array [I]. W s will be investigated in a fume
work.

9. References
[I] R.P. Brent, and F.T. Luk, The solution of singularvalue and symmetric eigenvalue problems on
multiprocessor arrays SIAM J. SCI. STAT. COMPLT,
vol.6, no. 1, pp. 69-84, January. 1985.
[2] R.P. Brent, F.T. Luk, and C. Van Loan
Computation of singular value decomposition using
mesh-connected processors J. =SI. Comput Syst, vol.
1, no. 3,pp. 242-270, 1985.
(31 B. Yang, and J.F. Bobme, Reducing the
computations of the singular value decomposition m a y
given by Brent and Luk SIAM J. Matrix. Anal. Appl,
vol. 12, no. 4,pp. 713-725, October. 1991.
[4] 1.R Cavallaro, F.T. Luk, CORDIC arithmetic for
an SVD processor Joumal of parallel and distributed
computing, vol. 5, pp. 271-290, 1988.
[5] Ray Andraka, A s w e y of CORDIC algorithms for
FFGA based computers in ACMISIGDA sixth
international symposium on field programmable gate
arrays, Motery, California, United states, 1998, pp. 191200.
[6] J.Volder. The CORDIC Computing Techruque, IRE
Trans. Comput., Sept. 1959,pp.330-334.
[7] J.S. Walther. A Unified Algorithm for elementary
functions. Roc. M I P S Spring Joint Computer
Conference, pp.379-385,1971.
[8] G.H. Golub, and C.F. Van Loan, Matrix
computations, The Johns Hopkins University Press,
London, 1996.
[9] www.Celoxica.com.
[lo] Celoxica application note, tbe technology behind
DKI,AN 18V1.0, 2001.

_.^

Figure 13. Area vs matrix size

-42

You might also like