You are on page 1of 6

Accelerating batched 1D-FFT with a

CUDA-capable computer
Calling CUDA library functions from a Java environment

R. de Beer and D. van Ormondt F. Di Cesare and D. Graveron-Demilly


Dept. of Applied Physics, Univ. of Technology Delft, NL CREATIS-LRMN, Univ Lyon 1, CNRS UMR 5220
E-mail: r.debeer@tudelft.nl INSERM U630, INSA Lyon, FR
E-mail: Danielle.Graveron@univ-lyon1.fr

D.A. Karras Z. Starcuk


Chalkis Inst. of Technology, Dept. Automation, Dept. of Magnetic Resonance and Bioinformatics,
Hellas, GR Inst. of Scientific Instruments of the ASCR, Brno, CR
E-mail: dakarras@ieee.org E-mail: starcuk@ISIBrno.Cz

2010-05-16 12:19 Since the GUI of our jMRUI software package has
been written in Java, we want to call relevant CUDA
Index Terms—Batched 1D-FFT, CUDA-enabled GPU, library fuctions from a Java environment. To that end
CUFFT library, Java bindings, home-assembled PC, the Java bindings for CUDA approach of jcuda.org [4],
jMRUI software package, exhaustive search in MRS
based on the Java Native Interface (JNI), was employed.
I. I NTRODUCTION Ultimately, we want to embed the Java-bindings based
This work concerns the application of CUDA-based calls to CUDA into a plug-in for the jMRUI platform [5].
software (Compute Unified Device Architecture), de-
veloped by NVIDIA for programmable Graphics Pro-
cessing units (GPUs) [1]. CUDA code is written in
‘C for CUDA’, indicating the standard C programming
language with NVIDIA extensions. The advantage of
using CUDA is that one can accelerate numerical compu-
tations, traditionally handled by Central Processing Units
(CPUs), by CUDA-enabled GPU devices, particularly if
the numerical problems at hand are suited for parallel
computing.
A. Our goal
Our goal was to find out, whether batched (multi-
ple) one-dimensional Fast Fourier Transformation (1D- Figure 1. NVIDIA GeForce 9600 GT -based graphics card (CUDA-
FFT), often encountered in various fields of signal pro- enabled) in a home-assembled desktop PC.
cessing, can be speeded up significantly by exploiting
the parellel-processing power of a low-cost, standard, II. M ETHODS
CUDA-enabled graphics card in a home-assembled PC.
The methods applied can be devided into hardware-
Batched 1D-FFT is of particular interest to us for the
and software-based. The hardware concerns a home-
following reasons:
made assemblage of a low-cost CUDA-capable desktop
• It is applied extensively in the Java Magnetic Reso-
PC. It’s essential parts are mentioned in II-A.The soft-
nance User Interface (jMRUI) software package of
ware part, described in II-B, concerns the installation of
our ‘FAST’ European Union project[2].
the CUDA software [1] and, in turn, the installation of
• In our recent work on handling unknown Mag-
the Java bindings for CUDA software package [4].
netic Resonance Spectroscopy (MRS) lineshapes,
batched 1D-FFT plays an essential role when ap- A. Hardware
plying exhaustive search in a semi-parametric, two- In order to get an impression of the cost and time
criterion, NLLS fit of the MRS parameters [3]. involved in building a CUDA-capable computer, we have

978-1-4244-6494-4/10/$26.00 ©2010 IEEE


assembled a low-cost desktop PC. The parts, chosen, B. Running the CUDA SDK test programs
reflect the PC state-of-the-art of about 1 21 years ago. We In order to verify the CUDA installation, one should
like to mention the following essential parts: run the CUDA SDK test programs deviceQuery and
• An ASUS P5Q PRO Turbo motherboard with Intel bandwidthTest.
Core 2 Duo E8400 CPU and 4 GB RAM.
• An ASUS EN9600GT/DI/512MD3 graphics card
with NVIDIA GeForce 9600 GT chipset (CUDA-
enabled GPU) and 512 MB DDR3 memory (see
Figure 1).
B. Software
1) Operating system: The home-assembled desktop
PC is equipped with the Ubuntu 9.10 (32-bit version)
Linux operating system. It was installed from a running
live CD, which gave the opportunity of first checking
the hardware. The basic installation took about one hour
(applying a software update by the Update Manager
included).
2) Installing a certified Linux NVIDIA driver: It is
important for the performance of a CUDA-capable Linux
system that one has installed a recent version of the
Linux NVIDIA driver. We have chosen to work with
the latest certified NVIDIA driver, which at the time of
this study was version 190.42. Figure 2. Valid result for the CUDA deviceQuery test program.
3) Installing the CUDA software: The CUDA Devel-
opment Tools for Linux 32-bit operating systems, that The result for the deviceQuery test on our system
we have used, had release number 2.3. They include is shown in Figure 2. When comparing with the result
two parts, called the CUDA Toolkit and the CUDA in the CUDA Getting Started manual [6] it can be seen
SDK respectively. The Toolkit contains the tools to build that it is a valid result for the NVIDIA GeForce 9600
and compile CUDA applications and the SDK contains GT device. The output also shows the fewer potentials of
code samples. Their installer files (shell scrips) can be GeForce 9600 GT with respect to more recent NVIDIA
downloaded from the CUDA website [1]. GeForce devices.
The CUDA Development Tools should be installed C. Performing CUDA-based batched 1D-FFT within a
by running their installer shell scrips. Before being able Java environment
to compile and execute CUDA applications, one should
take care of the required inclusions for the environment In order to test the computation time of CUDA-based
variables PATH and LD LIBRARY PATH. batched (multiple) 1D-FFT, when called from a Java
4) Installing the Java bindings for CUDA software: program, we took as a starting point the JCufftSam-
In order to provide access to the CUDA software from a ple.java code from the jcuda.org website [4]. In that
Java environment, we have downloaded and installed the sample Java code a CUFFT-library [7] -based complex-
Java bindings for CUDA software package (JCuda). Its to-complex forward 1D-FFT is executed and compared
download archive file contains all JAR files and library with the result obtained from the Java JTransforms
SOs required for 32-bit Linux [4]. [8] approach.
An important aspect of installing the JCuda software Since in the original JCufftSample.java program the
is to provide the correct Java CLASSPATH to the JCuda batch input parameter for the CUFFT 1D-FFT initial-
JAR files. We found it the easiest approach of using the ization (function cufftPlan1d) is set at 1, we could
-classpath option, when calling the Java program- not test the opportunity of performing batched 1D-FFT
ming language compiler (javac) and Java application in a parallel fashion [7]. We therefore have extended
launcher (java). the code, particularly by introducing a series (batch) of
1D input signals for the 1D-FFT (for JTransforms as
III. R ESULTS well as for CUFFT) and by introducing timing code.
A. Assembling the CUDA-capable desktop PC The essential (generalized) CUDA part of the extended
The cost of the CUDA-capable desktop PC (the moni- source code can be seen in Figure 3. Note that we have
tor excluded) amounted to about C600. Its total (summed included timing for cufftPlan1d as well as for the
up) assemblage time was of the order of a few days. actual 1D-FFT execution (function cufftExecC2C).
................................. System.nanotime() in Figure 3 suggests a very
int batch = 1024; high time resolution, the accuracy of the resulting timing
int size = 1024;
long timing1 = 0; values appeared to be much less.
long timing2 = 0;
float data[][]; D. Performing batched 1D-FFT with ‘C for CUDA’ code
data = new float[batch][size*2]; In order to get an impression of the overhead, caused
float tmp[];
tmp = new float[batch*size*2]; by calling CUDA functions from a Java environment (as
................................. was done for getting the results described in III-C), we
..for looping over JTransforms... repeated the CUDA part of the benchmark with code
.................................
for (int i=0; i<batch; i++) written directly in ‘C for CUDA’.
{ .................................
for (int j=0; j<size*2; j++) int batch = 1024;
{ int size = 1024;
tmp[i*size*2+j] = data[i][j]; struct timeval tv_start, tv_stop;
}
} cufftComplex *data_h, *data_d;
cufftHandle plan = new cufftHandle(); size_t memsize = batch*size*sizeof
(cufftComplex);
long time1 = System.nanoTime();
cudaMallocHost((void**)&data_h, memsize);
JCufft.cufftPlan1d(plan, size, cudaMalloc((void**)&data_d, memsize);
cufftType.CUFFT_C2C, batch);
.....insert data into data_h.....
long time2 = System.nanoTime();
cufftHandle plan;
JCufft.cufftExecC2C(plan, tmp, tmp, cudaMemcpy(data_d, data_h, memsize,
JCufft.CUFFT_FORWARD); cudaMemcpyHostToDevice);
long time3 = System.nanoTime(); cufftPlan1d(&plan, size, CUFFT_C2C, batch);
timing1 = time2 - time1;
gettimeofday(&tv_start, NULL);
timing2 = time3 - time2;
for (int i=0; i<batch; i++) cufftExecC2C(plan, data_d, data_d,
{
for (int j=0; j<size*2; j++) CUFFT_FORWARD);
{ gettimeofday(&tv_stop, NULL);
data[i][j] = tmp[i*size*2+j]; double timing_tv = (double)
} (tv_stop.tv_usec - tv_start.tv_usec)/1000.0;
}
cudaMemcpy(data_h, data_d, memsize,
JCufft.cufftDestroy(plan);
cudaMemcpyDeviceToHost);
.................................
.....extract data from data_h....
Figure 3. Snippet of extended code (generalized, essential CUDA part)
of the JCufftSample.java sample program [4]. The grey colorboxes cufftDestroy(plan);
denote the actual JCuda calls to the CUFFT library. Note the batch cudaFreeHost(data_h);
parameter.
cudaFree(data_d);

Table 1: Elapsed time (in msec) for batched JTransforms (not .................................
shown in the code of Figure 3) and batched JCufft.cufft- Figure 4. Snippet of ‘C for CUDA’ code (generalized) for batched
ExecC2C as a function of batch and size (both param- 1D-FFT. The grey colorboxes denote the actual CUDA calls. For the
eters were equalized). Also the JCufft 1D-FFT initialization sake of simplicity only one timing is shown. Note again the batch
time (JCufft.cufftPlan1d) is given. Finally, factor de- parameter.
notes the ratio of the batched JTransforms and batched
JCufft.cufftExecC2C times. In Figure 4 the (generalized) code for the ‘C for
256 512 1024 2048 CUDA’-based benchmark is presented. Compared to
JTransforms 38 199 351 385 Figure 3 there are more lines of code since memory
JCufft.cufftExecC2C 8 9 16 36 allocation/freeing (on host and CUDA device) as well
JCufft.cufftPlan1d 65 64 64 64
factor 5 22 22 11
as data transfer between host and device memory now
explicitly must be stated.

In Table 1 the benchmark results for batched 1D-FFT Table 2: Elapsed time (in msec) for batched cufftExecC2C as a
via JTransforms and via JCufft.cufftExecC2C function of batch and size (again both parameters were equalized).
Note the lack of the JCufft naming part, when compared to Table
as a function of the input parameters batch and size 1, indicating that now ‘C for CUDA’ code was used for the bench-
are given. In these tests both parameters were set equal mark. Again also the 1D-FFT initialization time (cufftPlan1d) is
to each other. Note, that although the timing function given. Finally, the data-transfer times to and from the CUDA device
(cudaMemcpy) are given.
the others were varied. The starting ratio was assumed to
256 512 1024 2084
cufftExecC2C 0.08 0.08 0.13 0.15
be found by a separate round of the first NLLS criterium.
cufftPlan1d 0.04 0.04 0.06 0.06 After applying a simple estimator for decay(t) [3]
cudaMemcpyHostToDevice 0.1 0.4 1.7 6.2 (for all values of a1 and a2 in the exhaustive-search
cudaMemcpyDeviceToHost 0.2 0.7 3.2 14.0 grid), an essential next step in the method is to perform
a batched 1D-FFT of all nmax1 × nmax
2 decay functions.
Table 3: Data-transfer bandwidth (in GB/sec) for cudaMemcpy as a
function of batch and size (both parameters were equalized).
Finally, the (a1 ,a2 )-combination is to be determined at
which the quantity |Re:FFT[decay(t)]| summed over the
256 512 1024 2084
cudaMemcpyHostToDevice 4.2 4.6 4.7 5.0 region |ν| > νthreshold is minimal.
cudaMemcpyDeviceToHost 2.4 2.8 2.5 2.2 .................................
int n1max = 90;
int n2max = 90;
In Table 3 the data-transfer bandwidth (the rate) int n3max = 1;
for cudaMemcpy (to and from the CUDA device) is
int batch = n1max*n2max*n3max;
presented, as calculated from int ndp = 1024;
1000 × memsize
bandwidth = , (1) int n1mul = n2max*n3max*ndp;
time × 10243 int n2mul = n3max*ndp;
where bandwidth is in GB/sec, memsize is the number int n3mul = ndp;
of bytes of the batched, complex-valued, float, data float summin = 1000000.0;
array and time is the elapsed time (timing) in msec.
The bandwidth quantity often is an important factor for float a1start = 0.25;
float a2start = 0.50;
measuring performance in the field of GPU computing float a3start = 1.00;
[9].
float a1step = 0.0001;
E. User-guided exhaustive search in MRS parameter float a2step = 0.0001;
float a3step = 0.0;
space
float sum;
When performing in vivo quantitation of metabolites float *a1, *a2, *a3, *modlshape;
in MRS, one ususally applies some form of nonlinear
cufftComplex decay, *data_h;
least-squares (NLLS) fitting of a physical model function
to the MRS data. In order to handle unknown lineshapes, ......other initializations......
.........and assignments.........
i.e. unknown decays in the measurement (time) domain,
we recently [3] have introduced a second NLLS criterion, for (n1=0; n1<n1max; n1++)
based on the general prior knowledge that the width {
a1[n1] = a1start - (float)
of the lineshape is limited. In the results, presented in 0.5*n1max*a1step + (float) n1*a1step;
this subsection, we have replaced this second NLLS for (n2=0; n2<n2max; n2++)
optimization step by an exhaustive search of the relevant {
a2[n2] = a2start - (float)
MRS parameters. 0.5*n2max*a2step + (float) n2*a2step;
The method has as a starting point [3], that under for (n3=0; n3<n3max; n3++)
certain conditions an in vivo MRS signal s(t) can be {
a3[n3] = a3start - (float)
modelled by 0.5*n3max*a3step + (float) n3*a3step;
s(t) = decay(t)ŝnodecay (t) + noise(t), (2) for (n=0; n<ndp; n++)
{
where decay(t) is the (usually unknown) decay function ...estimate decay from MRS signal...
and ŝnodecay (t) the a priori known non-decaying version
of the model function, obtained by summing over the index = n1*n1mul + n2*n2mul + n3*n3mul
+ n;
contributions from the individual metabolites. In this data_h[index].x = decay.x;
context the ˆ on ŝnodecay (t) indicates a known mathe- data_h[index].y = decay.y;
matical function with unknown values of the parameters. }
}
In the code listed below (see Figure 5) we have illus- }
trated the example case of having three MRS parameters }
(amplitudes) a1 , a2 and a3 , representing the concentra- ....batched 1D-FFT via cufft.....
tions of three metabolites. Since in the exhaustive search
for (n1=0; n1<n1max; n1++)
only the ratio of the amplitudes can be determined, the for (n2=0; n2<n2max; n2++)
third parameter a3 was kept fixed at 1.0 and the values of for (n3=0; n3<n3max; n3++)
{ IV. D ISCUSSION
sum = 0.0;
for (n=0; n<ndp; n++) A. Concerning the Java-based benchmark
{ The numbers in Table 1 show that the JCufft
index = n1*n1mul + n2*n2mul + n3*n3mul
+ n; initialization time is larger than the batched JCufft
modlshape[n] = execution time itself. Moreover, the initialization time is
sqrtf(data_h[index].x independent of the batch and size parameters (for the
*data_h[index].x);
given range). This suggests that acceleration of batched
if ((n > (int)(0.2*(float)ndp)) & JCufft with respect to batched JTransforms, as
(n < (int)(0.8*(float) ndp))) indicated by factor in Table 1, is realized only if
{
sum = sum + modlshape[n]; batched JCufft is performed several times with using
} the same JCufft initialization.
} Concerning the acceleration factor of 22 in Table
if (sum < summin) 1 (for batch = 512, 1024) we like to mention that
{ this result more or less agrees with results that can be
summin = sum; deduced from tables in [8] (for the case of single 1D-
n1tune = n1;
n2tune = n2; FFT). Moreover, in the latter benchmark it also is shown
n3tune = n3; that JTranforms has comparable execution times with
a1tune = a1[n1tune]; respect to FFTW [10]. This is of interest to us since
a2tune = a2[n2tune];
a3tune = a3[n3tune]; the FFTW method is applied in our jMRUI software
} package.
}
.................................

Figure 5. Snippet of ‘C for CUDA’ code (generalized) for user-guided


exhaustive search in MRS parameter space. The grey colorboxes denote
CUFFT-based code (see Figure 4). For the sake of simplicity timings
are not shown now.

Table 4: Elapsed time (in msec) and data-transfer bandwidth (in


GB/sec) for various parts of the exhaustive-search ‘C for CUDA’ code
in Figure 5 (batch = 8100 and ndp = 1024).

time bandwidth
insert data into data_h 417
cudaMemcpyHostToDevice 12 5.0
cufftPlan1d 0.06
cufftExecC2C 0.18
cudaMemcpyDeviceToHost 18 3.4
extract data from data_h 249
Figure 6. The quantity |Re:FFT[decay(t)]| as a function of index
(see text and Figure 5).
In Table 4 the benchmark results for various parts of B. Concerning the ‘C for CUDA’-based benchmark
the exhaustive-search ‘C for CUDA’ code in Figure 5 When comparing Table 1 with 2, a striking differ-
are shown. In this sample case we have worked with the ence in 1D-FFT initialization time (see timings of the
simulated MRS signal (noiseless version) used in our cufftPlan1d’s) seems to be indicated. However, from
lineshape study [3]. inspection of the JCufft Java class (see source code
Figure 6 gives an impression of the minimizing of [4]) it can be learned that its cufftPlan1D method
|Re:FFT[decay(t)]| (summed over |ν| > νthreshold ), performs dynamic library loading (library initialization),
when varying (a1 ,a2 ) (the third amplitude a3 was kept if it is the first time a JCufft method is being called.
fixed at 1.0). We have found an amplitude ratio of 0.2498 We have investigated this aspect any further in a separate
: 0.4994 : 1.0, which agrees well with the true ratio of code, in which the JCufft.cufftPlan1D method
0.25 : 0.5 : 1.0. was called twice. The timing found for the second call
Finally we like to note that we have called this subsec- yielded a negligible small value, in agreement with the
tion ‘User-guided exhaustive search’ because, firstly, we ‘C for CUDA’-results for cufftPlan1D listed in Table
have assumed to have a reasonably good starting ratio 2.
for the amplitudes (obtained with a separate round of the Another point of interest in the Table 1 - 2 com-
first NLLS criterium) and, secondly, we have established parison is the difference in cufftExecC2C tim-
in a user-interactive way a suited grid for (a1 ,a2 ). ings. Again here the explanation can be found in
the source code of the JCufft Java class. It ap- achieved if major parts of the numerical calculations
pears that memory allocation/freeing and data transfer can be carried out in the CUDA GPUs.
(between host and CUDA device), as mentioned in • In the context of the latter, enhanced double-
III-D, are included in the JCufft.cufftExecC2C precision and amount of local memory of re-
method. This is clearly demonstrated by the fact that the cent/future CUDA devices will become important.
summed timings for cudaMemcpyHostToDevice • Using CUDA-based batched 1D-FFT, we could
and cudaMemcpyDeviceToHost (in Table 2) are carry out a sample user-guided exhaustive-search
of the same order of magnitude as the timings for in MRS parameter space.
JCufft.cufftExecC2C (in Table 1).
ACKNOWLEDGMENT
Finally we like to note that the data-transfer band-
This work is supported by Marie-Curie Research Training Network
widths, presented in Table 3, are of the same order of ‘FAST’ (MRTNCT-2006-035801, 2006-2009).
magnitude as the 8 GB/sec bandwidth reported for the
PCI Express 2.0×16 bus [9] (used in our desktop PC). R EFERENCES
[1] NVIDIA, “CUDA 2.3 downloads,”
C. Concerning the exhaustive-search benchmark http://developer.nvidia.com/object/cuda 2 3 downloads.html,
2009.
From inspection of Table 4 it can be seen that inserting [2] D. Stefan, F. D. Cesare, A. Andrasescu, E. Popa, A. Lazariev,
data into or extracting from data_h (the array involved E. Vescovo, O. Strbak, S. Williams, Z. Starcuk, M. Cabanas,
in the batched 1D-FFT) by far is the dominant timing D. van Ormondt, and D. Graveron-Demilly, “Quantitation of
magnetic resonance spectroscopy signals: the jMRUI software
factor. This clearly is caused by the two fourfold for- package,” Meas. Sci. Technol., vol. 20, p. 104035 (9pp), 2009.
statement loopings involved (see Figure 5). This suggests [3] E. Popa, E. Capobianco, R. de Beer, D. van Ormondt, and
that these code parts also must be carried out in the D. Graveron-Demilly, “In vivo quantitation of metabolites with
an incomplete model function,” Meas. Sci. Technol., vol. 20, p.
CUDA device. 104032 (9pp), 2009.
[4] Jcuda.org, “Java bindings for CUDA,”
V. C ONCLUDING REMARKS http://www.jcuda.org/, 2010.
[5] D. Stefan, A. Andrasecu, E. Popa, H. Rabeson, O. Strbak,
Summarizing we like to make the following conclud- Z. Starcuk, M. Cabanas, D. van Ormondt, and D. Graveron-
ing remarks: Demilly, “jMRUI Version 4 : A Plug-in Platform,” in IEEE
International Workshop on Imaging Systems and Techniques, IST
• We have assembled a low-cost CUDA-capable 2008, Chania, Greece, 10-12 September 2008, pp. 346–348.
desktop PC, reflecting the PC state-of-the-art of [6] NVIDIA, “CUDA Getting Started, version 2.3,”
about 1 12 years ago. http://developer.nvidia.com/object/cuda 2 3 downloads.html,
2009, manual.
• Via the Ubuntu 9.10 Linux operating system we [7] ——, “CUDA CUFFT Library, version 2.3,”
could enable CUDA by installing a recent Linux http://developer.nvidia.com/object/cuda 2 3 downloads.html,
NVIDIA driver and the CUDA software (version 2009, manual.
[8] P. Wendykier, “JTransforms,”
2.3). http://sites.google.com/site/piotrwendykier/software/jtransforms,
• By applying the Java-bindings based JCuda soft- 2009, an open source multithreaded FFT library written in pure
ware package we could call CUFFT library func- Java.
[9] NVIDIA, “CUDA Best Practices Guide, version 2.3,”
tions from a Java environment. http://developer.nvidia.com/object/cuda 2 3 downloads.html,
• We could easily perform batched (multiple) 1D- 2009, manual.
FFT in a parallel fashion by exploiting the batch [10] M. Frigo and S.G. Johnson, “FFTW,”
http://www.fftw.org/, 2010, a C subroutine library for computing
facility of CUFFT 1D-FFT for a CUDA-enabled the discrete Fourier transform.
GPU device. In this way we could avoid for state-
ment looping, needed for the (CPU-based) reference
method.
• We could speed up the batched 1D-FFT execution
time by about a factor of 20 by applying the GPU-
based rather than the CPU-based approach.
• Easy comparison of Java-based and ‘C for CUDA’-
based benchmarking appeared to be hindered by the
choices made for the JCuda implementation.
• The CUDA-based benchmark results, reported in
this work, seemed to be limited by the data-transfer
bandwidth of the computer PCI Express 2.0×16
bus.
• If data-transfer speed indeed is the limiting factor,
significant computational accelerations can only be

You might also like