You are on page 1of 10

An Efficient, Model-Based CPU-GPU Heterogeneous FFT Library

Yasuhito Ogata1,3 , Toshio Endo1,3 , Naoya Maruyama1,3 , and Satoshi Matsuoka1,2,3


1
Tokyo Institute of Technology
2
National Institute of Informatics
3
JST, CREST
ogata, endo@matsulab.is.titech.ac.jp
naoya.maruyama, matsu@is.titech.ac.jp

Abstract GFlops per die, the latest high-end GPUs boast over 500
GFlops of performance. Another advantage in using GPUs
General-Purpose computing on Graphics Processing for HPC is its cost performance; they are less expensive due
Units (GPGPU) is becoming popular in HPC because of to commoditization than dedicated hardware accelerators
its high peak performance. However, in spite of the poten- for numerical computation such as GRAPE [9] or Clear-
tial performance improvements as well as recent promis- Speed [3], while still achieving equal or sometimes even
ing results in scientific computing applications, its real per- better performance.
formance is not necessarily higher than that of the current However, in spite of the potential performance improve-
high-performance CPUs, especially with recent trends to- ments as well as recent promising results in scientific com-
wards increasing the number of cores on a single die. This puting applications ( [6, 10] amongst many others), its real
is because the GPU performance can be severely limited by performance is not necessarily higher than that of the cur-
such restrictions as memory size and bandwidth and pro- rent high-performance CPUs, especially with recent trends
gramming using graphics-specific APIs. To overcome this towards increasing the number of cores on a single die. The
problem, we propose a model-based, adaptive library for reason of lower real performance of GPUs is three-fold.
2D FFT that automatically achieves optimal performance First, raw GPU performance improvements can be sacri-
using available heterogeneous CPU-GPU computing re- ficed considerably by the overhead of memory transfer be-
sources. To find optimal load distribution ratios between tween CPUs and GPUs. Second, the size of GPU mem-
CPUs and GPUs, we construct a performance model that ory and its usage restriction limit the maximum data size
captures the respective contributions of CPU vs. GPU, that can be computed by a single data transfer to a GPU,
and predicts the total execution time of 2D-FFT for arbi- thus further magnifying the memory bandwidth limitation.
trary problem sizes and load distribution. The performance Third, GPU peak performance is not always achievable with
model divides the FFT computation into several small sub standard graphics APIs, such as DirectX or OpenGL. Al-
steps, and predicts the execution time of each step using though the recent GPGPU programming environments such
profiling results. Preliminary evaluation with our prototype as CUDA [11] could reduce the overhead by providing more
shows that the performance model can predict the execution direct APIs for general scientific computing, the program-
time of problem sizes that are 16 times as large as the pro- ming model is not yet straightforward to a standard pro-
file runs with less than 20% error, and that the predicted grammer. More importantly, manually conducting proper
optimal load distribution ratios have less than 1% error. load balancing under a heterogeneous GPU-CPU environ-
We show that the resulting performance improvement us- ment for arbitrary data sizes remains a difficult challenge.
ing both CPUs and GPUs can be as high as 50% compared The objective of this research is to realize an adaptive
to using either a CPU core or a GPU. library for 2D FFT that automatically achieves optimal per-
formance using available heterogeneous CPU-GPU com-
puting resources. FFT is used for multitudes of applica-
1 Introduction tions such as image and signal processing and 3D protein
analysis. An important research question thereof is how
General Purpose computing on Graphics Processing to find the optimal load distribution among such hetero-
Units (GPGPU) is becoming popular in HPC. While the geneous resources, since the execution time on each re-
peak performance of a modern CPU is approximately 50 source must be balanced under different CPU-GPU con-

978-1-4244-1694-3/08/$25.00 ©2008 IEEE

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
figurations, problem sizes, etc. Although there have been





 
past attempts for heterogeneous clusters or grids with dif-
ferent CPU speeds [1], little is known for the combination
of CPUs and GPUs.
This paper presents our 2D-FFT algorithm and its
model-based optimal load distribution technique. The algo-
rithm is based on the row-column method, and can exploit

 
arbitrary combinations of heterogeneous CPUs and GPUs
by distributing 1D FFT computation among them. To find
optimal load distribution ratios, we construct a performance
model that captures the respective contributions of CPU vs.





 
 
GPU, and predicts the total execution time of 2D-FFT of
a given matrix. The performance model divides the FFT Figure 1. Example distribution of FFT compu-
computation into several small sub steps, and predicts the tation to one GPU and two CPUs
execution time of each step using profiling results on a par-
ticular CPU-GPU combination.
To evaluate the performance of our 2D FFT and the
effectiveness of performance modeling in finding optimal library for CPUs developed by Frigo et al. It automat-
load distribution, we have implemented a prototype CPU- ically selects an optimal algorithm by measuring perfor-
GPU heterogeneous FFT library using existing 1D FFT li- mance of pre-stage executions. GPUFFTW is an FFT li-
braries. Preliminary evaluation with the prototype shows brary for GPUs developed by Govindaraju et al. A notable
that the performance model can predict the execution time feature is an optimization based on automatically-identified
of problem sizes that are 16 times as large as the profile runs structure of graphics memory. NVIDIA’s CUFFT is a 2D
with less than 20% error, and that the predicted optimal load FFT library for GPUs implemented in the CUDA program-
distribution ratios have less than 1% error. We show that ming environment [11]. Although CUDA is limited to
the resulting performance improvement using both CPUs NVIDIA’s GeForce 8000 and later, it allows programmers
and GPUs can be as high as 50% compared to using either to write GPGPU programs in a C-like language without us-
CPUs or GPUs. ing graphics-specific APIs such as OpenGL. It also elimi-
nates restrictions in traditional GPGPU environments such
as read-only texture memory, thus achieving higher perfor-
2 Proposed Model-Based 2D-FFT Library
mance than traditional GPGPU environments.
Using Both CPUs and GPUs Note that these libraries support multi-row 1D-FFT with
a single library call, which is an important property for bet-
Our 2D-FFT library is based on the row-column method, ter performance on GPUs, since each library call is accom-
which first executes column-order 1D FFT for all the panied by CPU-GPU data communication. While the de-
columns of the input 2D matrix, and then row-order 1D FFT sign of our prototype allows the use of other 1D-FFT im-
for all the rows. In each 1D FFT, the rows and columns are plementations other than GPUFFTW or CUFFT, alternative
distributed among available computing resources: in this libraries should support multi-row 1D FFT for better per-
work, CPUs and GPUs. Figure 1 illustrates an instance of formance.
such distribution, where 65% of computation is allocated to
a GPU, 25% to a CPU, and 10% to another CPU. Determin- 2.2 Detailed Algorithm
ing the optimal distribution ratio is an important research
challenge for effective use of such heterogeneous hardware
Figure 2 illustrates the execution flow of our library. It
combinations; we show that our model-based approach can
executes column-order 1D-FFT first, and then row-order
significantly automate this process in Section 3. In this sec-
1D-FFT. Here, we describe three technical considerations
tion, we describe underlying FFT libraries that we use for
that are specific to our 2D FFT algorithm.
1D FFT and present detailed algorithm of our 2D FFT.

2.1 Underlying FFT libraries 2.2.1 Data Format Conversion


We have to convert data format according to the underlying
As 1D FFT implementations, the current prototype uses 1D-FFT libraries. Specifically, FFTW and CUFFT require
FFTW [5] for FFT on CPUs, and GPUFFTW [8] and the row-major format, while GPUFFTW the column-major
CUFFT [11] on GPUs. FFTW is a high performance FFT format. Thus, when we use CUFFT, we transpose the whole

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.


   
  
  





  



Table 1. Model Parameters








Param Description
Ktr Matrix transposition on CPU



 


  

 


KgMa
Kc2g
Memory allocation on GPU
Data transmission from CPU to GPU











 Kg2c
KgP
Data transmission from GPU to CPU
Pre-processing of 1D-FFT on GPU



 


  

 


KgDP Post-processing of 1D-FFT on GPU







KcFFT
KgFFT
1D-FFT on CPU
1D-FFT on GPU
  KgMF Memory releasing on GPU


 

 
!

"

Figure 2. Execution flows of the CUFFT-based


3 Model-Based Optimal Load Balancing
and GPUFFTW-based libraries.

We find optimal data distribution ratios by deriving a per-


matrix before the first phase, as illustrated in Figure 2 (a). formance model of our FFT library. The model gives a pre-
On the other hand, when we use GPUFFTW, we first trans- dicted execution time that are parameterized with the size of
pose only the region computed by CPUs as in Figure 2 (b), the input matrix, n×n, and the distribution ratio to a GPU, r.
and then the GPU region only at the last step of calculation. For a given 2D FFT of size n, we set r to the value that the
The current implementation performs all the transpose op- model predicts gives the shortest execution time. The rest
erations on CPUs; on dual-core machines, two threads are of this section describes the model of the CUFFT version of
used for the transpose. Our next prototype will have the our library; because the model for the GPUFFTW version is
choice of transposing matrices on GPUs as well in order to nearly identical with that for the CUFFT version, we do not
reduce data communication between CPUs and GPUs. discuss it here for the sake of brevity. In addition, the below
model assumes that only one thread is used for CPU-side
computation; we have observed that whether using a single
2.2.2 Memory Size Limitation thread or two does not change the best performance on the
Memory sizes of GPUs require special consideration, since CPU-GPU configuration used in our experiments.
typical GPUs have a smaller amount of memory than that of To derive the performance model, we divide the entire
host memory; even high-end GPUs currently have less than FFT computation into several sub steps, whose execution
a gigabyte memory, while hosts typically have more than time is represented with model parameters shown in Ta-
several gigabyte memory. Moreover, the size of data that ble 1. These parameters denote primitive performance of
can be placed on a GPU is even smaller when OpenGL is underlying FFT libraries and hardware. For example, we
used, since read-only texture memory and write-only frame model the CPU-to-GPU data transfer time as Kc2g × r × n2 ,
buffer have to be allocated separately. Thus, we support because we estimate that it increases linearly with the data
larger computation by iterating GPU library calls so that size. We determine the exact value of each model parame-
each call runs successfully within a given memory size. ter through several preamble executions for a given specific
hardware configuration.
2.2.3 CPU Load Balancing Figure 3 illustrates the execution steps of the CUFFT
version of our library. We have identified nine steps that
We allocate different sizes of data to be allocated to each
heavily affect total performance, such as 1D-FFT, transpo-
thread on CPUs. The reason behind this is load imbalance
sition, and data transfer. When a matrix size is larger than
that can be caused in GPGPU environments. Our prelim-
the available GPU memory, the library divides the matrix to
inary experiments using CUDA version 0.8 on a dual-core
sub matrices and iterates the steps for each sub matrix.
CPU indicated that the GPU control thread mostly occupies
one of the two cores. To run programs efficiently even on Using the model parameters, we first define the perfor-
such load-imbalanced CPU cores, we need to adjust data mance models of row- and column-major 1D FFT for CPUs
size depending on the load of each core. and GPUs. We define the CPU models for row- and column-

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.


 






 

   " 

  !  ! &    '  "!
 #
&'$
%
" "  !  ,+ 
 
 #!$ 
%  ! & ' (  '   

   
)* )"* - ( $), 

Figure 4. Relative execution time to trans-
pose a 81922 matrix.

Finally, we define the total performance model by the


Figure 3. Execution steps in our model for the sum of row and column execution time. We model the time
CUFFT-based library. of each of row and column computation by the maximum of
the CPU and GPU time, since the row-column method re-
quires synchronization before each phase. Thus, we model
major 1D FFT, Ccol and Crow , as follows: the total time T as follows:

Ccol = (1 − r)n2 · 2Ktr · Mtr (1 − r) + T = max(Gcol , Ccol ) + max(Grow , Crow ) (5)


(1 − r)n2 · log n · KcFFT (1) In these models, we assume that matrix transpose is done
Crow = (1 − r)n2 · log n · KcFFT (2) by using two threads on the CPUs, while 1D-FFT on the
CPUs is done by only one thread. Supporting more flexible
Here, n is the length of each dimension in the input 2D configuration remains the subject for future work.
matrix, and r is the distribution ratio to a GPU, where
0 ≤ r ≤ 1. Mtr is a function from r to real values for adjust- 3.1 Fitting Predicted Transpose Time
ing matrix transpose time, and is described in more detail in
Section 3.1. The first and second terms of Equation (1) are
Although we have expected that the time to transpose a
the predicted time of the matrix transpose and 1D FFT on
matrix would increase linearly with the size of the matrix,
a CPU, respectively. Note that the row-major 1D FFT does
it turned out that the linear model does not fit well with ac-
not require matrix transpose as illustrated in Figure 2; thus,
tual performance data when multiple threads are used. Fig-
unlike Equation (1), Equation (2) only includes the term for
ure 4 shows the difference between the real and predicted
1D FFT.
performance for a 81922 matrix using two threads. The
The GPU models for row- and column-major 1D FFT
x-coordinate corresponds to the data distribution ratio be-
are as follows:
tween the threads, and the y-coordinate to the execution
Gcol = rn2 · 2Ktr · Mtr (r) + rn2 · log n · KgFFT + time of one of the two threads. The graph shows relative
time that we calculate by dividing the transpose time with
rn2 (KgMa + Kc2g + KgP + KgDP + Kg2c ) (3) two threads running by the time with only a single thread
Grow = rn2 · log n · KgFFT + running. The “real” graph shows the actual results using
rn2 (Kc2g + KgP + KgDP + Kg2c + KgMF ) (4) the same experimental platform as in Section 4, which uses
Intel Core2 Duo 2.13GHz CPU. If using two threads for
The first and second terms of Equation (3) are the predicted transpose had halved the execution time, the line would
time of the matrix transpose and 1D FFT on a GPU, re- have been one for all x points. However, in reality, the rel-
spectively. As Equation (2), Equation (4) does not have a ative performance decreased by up to 50%; it particularly
term for matrix transpose. The last terms of both equations suffered when the data size is small, while approaching to
accommodate the time for setting up and finalizing GPU the single-thread performance as the data distribution ratio
processing. approaches to 100%.

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.

We speculate that the reason of this performance de-
!
-"
#
.
/#
$
0%
3
2&'(
6$
)*
23+$
9'
&,.
-/#
#
$%,
crease is an adverse effect on data cache by two threads
 -.
/1
0
14
3
2
4 5
567
2
78
3
8 1
9
1&:/
#
#,
$
running simultaneously, since the dual-core CPU used in
 !"
:/
##
$


this experiment shares the same L2 data cache between the
two cores. Let r be the distribution ratio to a thread in per-
  


centage. The thread runs in parallel with the other thread




while computing the first min(r, 100−r) percent of data, and
runs solely for the remaining part, i.e., r − min(r, 100 − r)

percent. Thus, when the ratio is smaller than 50%, it always
runs with the other thread; when it is greater than 50%, it 
runs in parallel while processing the first (100−r)% of data,
but exclusively while processing the remaining (2r−100)%.  


 
  
For example, when r is 60%, the thread runs in parallel with
the other thread while processing 40% of an input matrix,
and exclusively while processing the remaining 20%. Thus, Figure 5. Performance comparison with exist-
the effect by two simultaneous threads should be the most ing 2D-FFT libraries
visible for r being less than 50%, and decrease as it in-
creases from 50% to 100%. In fact, as shown in the figure,
the relative execution time approaches to the single-thread
case as the distribution ratio increased from 50% to 100%, 4.1 Performance of the Prototype FFT Li-
and it steadily increased by nearly 1.5 times when the ra- brary
tio was smaller than 50%, except for the smallest ratio case
(r = 5%). We plan to conduct further analysis with detailed
As a baseline performance study, we compare our li-
profiling of cache usage and other performance metrics.
brary with existing 2D-FFT libraries: the original FFTW
To adjust our linear models to real performance, we in- and CUFFT. Figure 5 shows the execution time of 2D FFT
troduce a function of data distribution ratio r, where 0 ≤ r ≤ while changing the length of each dimension from 256 to
1. The function, Mtr , approximates the real execution time 8192. We use FFTW and CUFFT in our library as a build-
shown in Figure 4, and is defined as follows: ing block that computes 1D FFT. Thus, our current 2D-
 FFT implementation incurs overhead of transposing matri-

 r − 0.5

1 − Ctr + Ctr if r > 0.5 ces and transferring between CPUs and GPUs In particular,
Mtr (r) = 
 0.5 (6)

Ctr if r ≤ 0.5
the CUFFT-based library is almost twice as slow as the orig-
inal CUFFT when the data size is 4000. This is an artifact
of generalization and by all means not a shortcoming of our
Here, we use Ctr to fit Mtr to measured performance; in this approach; for example, since we do not have the memory-
particular case, we set Ctr to 1.5. size restriction on the current GPUs, our library can com-
pute much bigger matrices than the original CUFFT. We
also see that the performance of our CUFFT-based imple-
4 Experimental Evaluation
mentation is faster than the GPUFFTW-based one by 32%
when the data size is 8192. This result reflects the organiza-
To evaluate the effectiveness of our proposal, we con- tion of both libraries, i.e., CUFFT requires less data move-
duct two experimental studies: comparison of the library ments than GPUFFTW.
with CPU-only and GPU-only cases, and evaluation of the Next, we evaluate the effect of varying distribution ra-
accuracy of the performance modeling. For both of the tios on total performance using both the CPU and GPU.
studies, we use an x86 machine with Intel Core 2 Duo Figure 6 shows the performance of both the CUFFT-based
E6400 (2.13GHz), 4GB of PC6400 memory, and a GeForce and GPUFFTW-based implementations, where we select
8800GTX GPU. The GPU is equipped with 128 stream pro- the optimal problem distribution ratio (labeled with “GPU
cessors running at 1.35GHz with 768MB of video mem- Ratio: Optimal”), 0% to GPU (“GPU Ratio: 0%”), and
ory. We use Linux kernel 2.6.18 with NVIDIA display 100% to GPU (“GPU Ratio: 100%”). We see that com-
driver version 97.46 and CUDA version 0.8. The versions pared to the GPU-only case, the optimally-distributed ver-
of CUFFT and GPUFFTW are 0.81 and 1.0, respectively. sions of CUFFT and GPUFFTW achieved 19% and 55%
The 2D matrices used in the following experiments consist improvements, respectively, and 33% and 2.4% improve-
of two 32-bit floating-point numbers, each of which repre- ments, compared to the CPU-only case.
sents real or complex part, respectively. Figure 7 shows the execution time of our CUFFT-based

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.

!

!
"
#
"
#%
$
% &&'
 )(
'
'( *"+
+,
,
 !
"
#$
%
$ 
&)'
'( !"+
+--
!

!"
#%
$ &.%
/ $
!


*"
+
+
,
  "
#%
$ &.%
/ $ "+
+,

 
 "
 "
#

#$$
%%&
 






 !
 

 


    


Figure 6. Performance comparison with vary-
ing problem sizes

 
  


Figure 7. Performance of our library with
library with varying distribution ratios, while the size of varying distribution ratios.
each dimension is fixed at 8192. The X-axis represents the
ratio of computation allocated to the GPU; 100% means
that all the computation is done at the GPU, and 0% that
no computation is done at the GPU. The two graphs show
the execution time when using either a single CPU thread
or two. In both cases, allocating 70% of computation to the
GPU achieved the best performance—1.2 times faster than
the GPU-only case. Compared to the CPU-only case, the
optimal distribution achieved 2.2 times and 1.5 times im-
provements against the single-thread and two-thread cases,
respectively.
As shown in the graph, the performance of the single-
threaded and multi-threaded libraries was nearly the same
when more than 50% of computation was allocated to the Table 2. Learned model parameters using ei-
GPU. Further analysis revealed that the overhead of the ther 512-length or 8192-length profile runs.
GPU control thread nearly occupied a single CPU core, thus 512 8192
using two CPU threads did not improve the performance.
This overhead by the GPU control thread is reportedly re- Ktr 1.11 × 10−8 1.47 × 10−8
duced by the latest CUDA version 1.0; evaluation using the KgMa 5.95 × 10−10 6.93 × 10−11
latest version remains future work. Kc2g 6.16 × 10−9 5.92 × 10−9
Kg2c 5.68 × 10−9 5.09 × 10−9
4.2 Modeling Accuracy KgP 2.73 × 10−11 2.83 × 10−10
KgDP 7.28 × 10−12 2.46 × 10−9
To evaluate the accuracy of our performance modeling KcFFT 2.61 × 10−9 3.21 × 10−9
and its effectiveness in finding optimal distribution ratios, KgFFT 1.65 × 10−10 1.88 × 10−10
we compare the predicted and real execution time of the KgMF 3.01 × 10−9 6.08 × 10−10
CUFFT-based library. Note that we only consider a model
for the single-threaded case; as shown in Section 4.1, the
optimal distribution ratio does not differ between using only
one thread or two.
First, we evaluate the modeling accuracy when predict-
ing the performance of the same matrix size as profile runs.

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.



 #&$'##
%
"!  

 
%(&)%%
'
 $" 
  #
! 
 




 
    



Figure 8. Comparison of the real performance 
   

of 8192-length 2D FFT with the predicted per-
formance using profile runs with the same
data size. Figure 9. Comparison of the real performance
of 1024-length 2D FFT with the predicted per-
formance using 512-length profile runs.

Figure 8 shows the real and predicted execution time of 2D


FFT, where the length of each dimension is 8192. Table 2
shows the learned parameter values of the model parame-
ters. We vary the distribution ratio from 0% to 100% at 5%
intervals. We used two profile runs: using either the CPU
or the GPU. As shown in the figure, we successfully identi-


fied the optimal distribution ratio, while the execution time
prediction for each distribution ratio had a small error: The
maximum error was 2.9%, and the majority was under 5%.


% #&$#
Next, we evaluate the modeling accuracy when predict-
" 
!
ing the performance of different matrix size as profile runs.


As profile runs, we use the performance profiles of 512-
length 2D FFT using either the CPU or the GPU. Fig-


 
ure 9 compares the predicted and real performance of 1024-

 

length 2D FFT. We see that our performance model suc-
cessfully found the optimal distribution ratio; the average
execution-time prediction error was approximately 4.5%,
and the maximum was less than 10.6%. 
Figure 10 shows the predicted and real performance of
8192-length 2D FFT. Here, we see relatively large predic- 
 
   

tion errors on the execution time, i.e., our model predicted


shorter execution time by approximately 20%.
To understand the cause of the prediction error, we ex- Figure 10. Comparison of the real perfor-
amined the breakdowns of both the real and predicted exe- mance of 8192-length 2D FFT with the pre-
cution time of 8192-length 2D FFT. We used 512-length 2D dicted performance using 512-length profile
FFT as profile runs. Figure 11 compares the real and pre- runs.
dicted cases for both the CPU and GPU. Each colored rect-
angle corresponds to one of the FFT computation steps; the
dotted red lines denote synchronization among the CPU and
GPU threads. We see that the matrix transpose and CPU 1D
FFT caused the largest gaps between the real performance

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
 in total, we saved 0.79 second and 1.63 seconds, compared


to GPU-only and CPU-only cases, respectively.

 !"*#+,#+-$
) .%&' #'
(
5 Related Work

 /
$ #
% 0& ,,"1
142%%
&3*- We relate our approach to past attempts to execute

$ % &! 5!
68#7!!8
FFT on GPUs as well as performance modeling of high-

 

 %3* -
performance GPGPU. Note that little research has consid-

4 % & ,516$%&
ered using both CPUs and GPUs simultaneously; as far as

 4
8"%* &
- 0 9 1#
07
# !!8 we know, there has been no study on performance modeling


of FFT using both CPUs and GPUs.

$ %& ' *317 Moreland and Angel presented their method to execute
2D FFT on NVIDIA GeForce FX 5800 using graphics-


specific APIs, including OpenGL and Cg [10]. Fi-
alka and Čadı́k further improved FFT performance by
using NVIDIA GeForce 6600GT with DirectX9.0C and


 
 

 

 
HLSL [4].

  Underwood et al. proposed a technique to predict execu-


tion time of FFT on FPGAs [13]. Similar to us, they con-
struct a performance model of FFT by dividing the total ex-
ecution into several sub steps, and derive the model param-
eters from profiling results. Once the model is generated, it
Figure 11. Breakdown Comparison of the real predicts the total execution time for arbitrary data sizes and
performance of 8192-length 2D FFT with the FPGA performance. The notable difference between their
predicted performance using 512-length pro- modeling and ours is that we consider using both CPUs and
file runs. GPUs for further higher performance. Other difference in-
cludes the.
BrookGPU by Buck et al. is a programming environ-
ment for GPUs that facilitates GPU programming without
and our model. The prediction error of transpose time sug- specific knowledge on graphics processing [2]. BrookGPU
gests the fitting error of Mtr . We will derive a more accurate creates a performance model for predicting the performance
model for transpose time in our future work. benefit using GPUs against standard CPUs. Their model
We speculate the cause of the prediction error on CPU uses linear regression where the dependent variable cor-
1D FFT as follows. First, the difference of cache hit ra- responds to execution time, and independent variables to
tios between 512-length and 8192-length FFT could affect the amounts of computation and data transfer. Unlike
the performance differently. Our future work will analyze BrookGPU, our model is not restricted to linear relation-
the effect by detailed performance profiling. Second, as de- ship, thus being able to achieve more accurate prediction.
picted in Figure 11, the CPU 1D FFT step overlapped with Furthermore, the accuracy of their performance model has
the transpose step by the GPU control thread. Thus, they not been evaluated.
could interfere with each other, causing the O(n2 log n) 1D The GEMM library using both CPUs and GPUs by
FFT performance model inaccurate. Note, however, that Ohshima et al. [12] achieved up to 40% performance im-
the prediction of optimal distribution ratio has only 0.8% provements compared to using only CPUs or GPUs. Sim-
error: the time with the true optimal was 3.423 seconds, ilar to us, they predict the optimal load distribution ratio
while with the predicted ratio was 3.394 seconds. Our mod- between CPUs and GPUs from profiling results. However,
eling is still effective even when finding optimal ratios of unlike our model-based approach, their prediction is not ap-
problem size four times as large as profile runs. plicable to different data sizes of problems. More impor-
Overall, these results show that our prediction method tantly, our modeling considers not only the time spent in
can find optimal distribution ratios of large size problems GPU processing, but also the time for data transfer between
from small size ones. Such modeling robustness is espe- CPUs and GPUs.
cially important for problems as 2D FFT, since the com- Another closely related work is the GPU memory model
putation has O(N 2 log N) time complexity, where N is the proposed by Govindaraju et al [7]. It creates a performance
length of each dimension. For example, by taking two pro- model to find optimal blocking sizes for scientific applica-
filing runs of 512-length 2D FFT, which took 0.032 second tions, and shows two- to five-fold speedups using the block-

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
ing size identified by the model. Our contribution is orthog- References
onal to theirs: combining both models could automate iden-
tifying both optimal sizes of blocking and load distribution [1] F. Almeida, D. Gonzalez, and L. M. Moreno. The master-
among heterogeneous CPU-GPU configurations. Such an slave paradigm on heterogeneous systems: A dynamic pro-
extension is a subject of our future work. gramming approach for the optimal mapping. Elsevier Jour-
nal of Systems Architecture, 52:105–116, 2006.
6 Conclusion [2] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,
M. Houston, and P. Hanrahan. Brook for gpus: stream
computing on graphics hardware. ACM Trans. Graph.,
We have presented our 2D-FFT library for heteroge- 23(3):777–786, 2004.
neous CPU-GPU configurations along with its performance [3] ClearSpeed Technology plc. ClearSpeed white paper: Csx
model. To find optimal data distribution ratios among CPUs processor architecture. http://www.clearspeed.com/.
and GPUs, we construct a performance model of our library [4] O. Fialka and M. Čadı́k. Fft and convolution performance in
by dividing the 2D-FFT computation into small sub exe- image filtering on gpu. In IV ’06: Proceedings of the con-
cution steps, and determine the performance of each step ference on Information Visualization, pages 609–614, Wash-
by preliminary profiling runs. Using the determined per- ington, DC, USA, 2006. IEEE Computer Society.
formance of each sub step, we derive a mathematical model [5] M. Frigo and S. G. Johnson. The design and implementa-
that predicts the total execution times of 2D FFT of arbitrary tion of FFTW3. In Proceedings of the IEEE: Special issue
data sizes. For a particular 2D matrix, we determine the op- on Program Generation, Optimization, and Platform Adap-
tation, volume 93, pages 216–231, 2005.
timal load distribution ratio by finding the shortest predicted
[6] N. Galoppo, N. K. Govindaraju, M. Henson, and
execution time. Our preliminary evaluation has shown that
D. Manocha. LU-GPU: Efficient algorithms for solving
the model can predict the execution time of 16 times as large dense linear systems on graphics hardware. In SC ’05: Pro-
problems as the profile runs with less than 15% error, and ceedings of the 2005 ACM/IEEE conference on Supercom-
that the predicted optimal load distribution ratios have less puting, page 3, Seattle, WA, 2005.
than 1% error. The overall performance improvements with [7] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A
the performance model ranged from 2.4% to 50% compared memory model for scientific algorithms on graphics proces-
to the CPU-only and GPU-only configurations. sors. In Proceedings of the 2006 ACM/IEEE conference on
Our future work includes the following. First, for GPUs Supercomputing (SC’06), page 6, Tampa, FL, 2006.
with CUDA, we plan to optimize the 2D-FFT algorithm by [8] N. K. Govindaraju and D. Manocha. GPUFFTW: High per-
using the CUDA-specific features for more efficient use of formance power-of-two fft library using graphics processors.
the potential performance of GPUs. Specifically, it would http://gamma.cs.unc.edu/GPUFFTW/.
enable to transpose matrices on GPUs as well as CPUs, thus [9] J. Makino, E. Kokubo, and T. Fukushige. Performance eval-
uation and tuning of grape-6 - towards 40 ”real” tflops. In
reducing the data communication overhead substantially. In
SC ’03: Proceedings of the 2003 ACM/IEEE conference on
addition, the current released version (v1.0) of CUDA re- Supercomputing, page 2, Phoenix, AZ, 2003.
portedly reduces the overhead of the GPU control thread [10] K. Moreland and E. Angel. The FFT on a GPU. In
than the version we have used in this work (v0.8). We ex- SIGGRAPH/Eurographics Workshop on Graphics Hardware
pect that the performance of our CPU-GPU library using the 2003 Proceedings, pages 112–119, July 2003.
newer CUDA would improve as well. Second, we will ex- [11] NVIDIA Corporation. NVIDIA CUDA Compute
plore the performance effects of using more CPU cores. We Unified Device Architecture Programming Guide.
expect that such configurations would magnify the advan- http://developer.nvidia.com/object/cuda.html.
tage of our model-based approach. Finally, we will extend [12] S. Ohshima, K. Kise, T. Katagiri, and T. Yuba. Parallel pro-
our library to exploit the parallelism not only in a single cessing of matrix multiplication in a cpu and gpu hetero-
machine but also in a cluster of machines with GPUs. Our geneous environment. In VECPAR’06 - Seventh Interna-
extended performance model will enable to use such further tional Meeting on High Performance Computing for Com-
putational Science, pages 305–318, 2006.
heterogeneous computing resources without much human
[13] K. D. Underwood, K. S. Hemmert, and C. Ulmer. Architec-
burden. tures and apis: Assessing requirements for delivering fpga
performance to applications. SC, 00:49, 2006.
Acknowledgments
Biographies
We wish to thank Naga Govindaraju and Akira Nukada
for earlier ideas of this research. This work is supported Yasuhiko Ogata received his B.Sc in computer sci-
in part by Microsoft Technical Computing Initiative, and in ence from Tokyo Institute of Technology in 2007, and is
part by Japan Science and Technology Agency as a CREST a graduate student in the Mathematical and Computing
research program entitled “Ultra-Low-Power HPC”. Sciences, Tokyo Institute of Technology. His major

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
research interests include effective use of multi-core and
other parallel and distributed systems.
Toshio Endo is an associate professor at the Global COE
program, ”Computationism as a Foundation for the Sci-
ences”, at Tokyo Institute of Technology. He received his
Ph.D. from the University of Tokyo in 2001. His major re-
search interests include distributed computing and parallel
computing on heterogeneous architectures. He is a member
of IEEE-CS and ACM.
Naoya Maruyama is a Ph.D. candidate graduate student
at Tokyo Institute of Technology, expected to graduate in
March 2008. He received his Master’s degree at Tokyo In-
stitute of Technology in 2003. His research interests in-
clude cluster and Grid computing, statistical techniques for
system management, and program analysis. His recent re-
search focuses on fault detection and localization through
online system monitoring and modeling.
Satoshi Matsuoka received his Ph. D. from the Univer-
sity of Tokyo in 1993. He became a full Professor at
the Global Scientific Information and Computing Center
(GSIC) of Tokyo Institute of Technology (Tokyo Tech /
Titech) in April 2001, leading the Research Infrastructure
Division Solving Environment Group of the Titech cam-
pus. He has pioneered grid computing research in Japan the
mid 90s along with his collaborators, and currently serves
as sub-leader of the Japanese National Research Grid Ini-
tiative (NAREGI) project, which aims to create middleware
for next-generation CyberScience Infrastructure. He was
also the technical leader in the construction of the TSUB-
AME supercomputer, which has become the fast supercom-
puter in Asia-Pacific in June, 2006 at 85 Teraflops (peak,
now 111 Teraflops as of March 2009) and 38.18 Teraflops
(Linpack, 7th on the June 2006 list) and also serves as the
core grid resource in the Titech Campus Grid. He has been
(co-) program and general chairs of several international
conferences including ACM OOPSLA’2002, IEEE CCGrid
2003, HPCAsia 2004, Grid 2006, CCGrid 2006/2007/2008,
as well as countless program committee positions, in par-
ticular numerous ACM/IEEE Supercomputing Conference
(SC) technical papers committee duties including serving
as the network area chair for SC2004, SC2008, and will be
the technical papers chair for SC2009. He served as a Steer-
ing Group member and an Area Director of the Global Grid
Forum during 1999-2005, and recently became the steer-
ing group member of the Supercomputing Conference. He
has won several awards including the Sakai award for re-
search excellence from the Information Processing Society
of Japan in 1999, and recently received the JSPS Prize from
the Japan Society for Promotion of Science in 2006 from
his Royal Highness Prince Akishinomiya.

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.

You might also like