Professional Documents
Culture Documents
Abstract GFlops per die, the latest high-end GPUs boast over 500
GFlops of performance. Another advantage in using GPUs
General-Purpose computing on Graphics Processing for HPC is its cost performance; they are less expensive due
Units (GPGPU) is becoming popular in HPC because of to commoditization than dedicated hardware accelerators
its high peak performance. However, in spite of the poten- for numerical computation such as GRAPE [9] or Clear-
tial performance improvements as well as recent promis- Speed [3], while still achieving equal or sometimes even
ing results in scientific computing applications, its real per- better performance.
formance is not necessarily higher than that of the current However, in spite of the potential performance improve-
high-performance CPUs, especially with recent trends to- ments as well as recent promising results in scientific com-
wards increasing the number of cores on a single die. This puting applications ( [6, 10] amongst many others), its real
is because the GPU performance can be severely limited by performance is not necessarily higher than that of the cur-
such restrictions as memory size and bandwidth and pro- rent high-performance CPUs, especially with recent trends
gramming using graphics-specific APIs. To overcome this towards increasing the number of cores on a single die. The
problem, we propose a model-based, adaptive library for reason of lower real performance of GPUs is three-fold.
2D FFT that automatically achieves optimal performance First, raw GPU performance improvements can be sacri-
using available heterogeneous CPU-GPU computing re- ficed considerably by the overhead of memory transfer be-
sources. To find optimal load distribution ratios between tween CPUs and GPUs. Second, the size of GPU mem-
CPUs and GPUs, we construct a performance model that ory and its usage restriction limit the maximum data size
captures the respective contributions of CPU vs. GPU, that can be computed by a single data transfer to a GPU,
and predicts the total execution time of 2D-FFT for arbi- thus further magnifying the memory bandwidth limitation.
trary problem sizes and load distribution. The performance Third, GPU peak performance is not always achievable with
model divides the FFT computation into several small sub standard graphics APIs, such as DirectX or OpenGL. Al-
steps, and predicts the execution time of each step using though the recent GPGPU programming environments such
profiling results. Preliminary evaluation with our prototype as CUDA [11] could reduce the overhead by providing more
shows that the performance model can predict the execution direct APIs for general scientific computing, the program-
time of problem sizes that are 16 times as large as the pro- ming model is not yet straightforward to a standard pro-
file runs with less than 20% error, and that the predicted grammer. More importantly, manually conducting proper
optimal load distribution ratios have less than 1% error. load balancing under a heterogeneous GPU-CPU environ-
We show that the resulting performance improvement us- ment for arbitrary data sizes remains a difficult challenge.
ing both CPUs and GPUs can be as high as 50% compared The objective of this research is to realize an adaptive
to using either a CPU core or a GPU. library for 2D FFT that automatically achieves optimal per-
formance using available heterogeneous CPU-GPU com-
puting resources. FFT is used for multitudes of applica-
1 Introduction tions such as image and signal processing and 3D protein
analysis. An important research question thereof is how
General Purpose computing on Graphics Processing to find the optimal load distribution among such hetero-
Units (GPGPU) is becoming popular in HPC. While the geneous resources, since the execution time on each re-
peak performance of a modern CPU is approximately 50 source must be balanced under different CPU-GPU con-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
figurations, problem sizes, etc. Although there have been
past attempts for heterogeneous clusters or grids with dif-
ferent CPU speeds [1], little is known for the combination
of CPUs and GPUs.
This paper presents our 2D-FFT algorithm and its
model-based optimal load distribution technique. The algo-
rithm is based on the row-column method, and can exploit
arbitrary combinations of heterogeneous CPUs and GPUs
by distributing 1D FFT computation among them. To find
optimal load distribution ratios, we construct a performance
model that captures the respective contributions of CPU vs.
GPU, and predicts the total execution time of 2D-FFT of
a given matrix. The performance model divides the FFT Figure 1. Example distribution of FFT compu-
computation into several small sub steps, and predicts the tation to one GPU and two CPUs
execution time of each step using profiling results on a par-
ticular CPU-GPU combination.
To evaluate the performance of our 2D FFT and the
effectiveness of performance modeling in finding optimal library for CPUs developed by Frigo et al. It automat-
load distribution, we have implemented a prototype CPU- ically selects an optimal algorithm by measuring perfor-
GPU heterogeneous FFT library using existing 1D FFT li- mance of pre-stage executions. GPUFFTW is an FFT li-
braries. Preliminary evaluation with the prototype shows brary for GPUs developed by Govindaraju et al. A notable
that the performance model can predict the execution time feature is an optimization based on automatically-identified
of problem sizes that are 16 times as large as the profile runs structure of graphics memory. NVIDIA’s CUFFT is a 2D
with less than 20% error, and that the predicted optimal load FFT library for GPUs implemented in the CUDA program-
distribution ratios have less than 1% error. We show that ming environment [11]. Although CUDA is limited to
the resulting performance improvement using both CPUs NVIDIA’s GeForce 8000 and later, it allows programmers
and GPUs can be as high as 50% compared to using either to write GPGPU programs in a C-like language without us-
CPUs or GPUs. ing graphics-specific APIs such as OpenGL. It also elimi-
nates restrictions in traditional GPGPU environments such
as read-only texture memory, thus achieving higher perfor-
2 Proposed Model-Based 2D-FFT Library
mance than traditional GPGPU environments.
Using Both CPUs and GPUs Note that these libraries support multi-row 1D-FFT with
a single library call, which is an important property for bet-
Our 2D-FFT library is based on the row-column method, ter performance on GPUs, since each library call is accom-
which first executes column-order 1D FFT for all the panied by CPU-GPU data communication. While the de-
columns of the input 2D matrix, and then row-order 1D FFT sign of our prototype allows the use of other 1D-FFT im-
for all the rows. In each 1D FFT, the rows and columns are plementations other than GPUFFTW or CUFFT, alternative
distributed among available computing resources: in this libraries should support multi-row 1D FFT for better per-
work, CPUs and GPUs. Figure 1 illustrates an instance of formance.
such distribution, where 65% of computation is allocated to
a GPU, 25% to a CPU, and 10% to another CPU. Determin- 2.2 Detailed Algorithm
ing the optimal distribution ratio is an important research
challenge for effective use of such heterogeneous hardware
Figure 2 illustrates the execution flow of our library. It
combinations; we show that our model-based approach can
executes column-order 1D-FFT first, and then row-order
significantly automate this process in Section 3. In this sec-
1D-FFT. Here, we describe three technical considerations
tion, we describe underlying FFT libraries that we use for
that are specific to our 2D FFT algorithm.
1D FFT and present detailed algorithm of our 2D FFT.
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
Table 1. Model Parameters
Param Description
Ktr Matrix transposition on CPU
KgMa
Kc2g
Memory allocation on GPU
Data transmission from CPU to GPU
Kg2c
KgP
Data transmission from GPU to CPU
Pre-processing of 1D-FFT on GPU
KgDP Post-processing of 1D-FFT on GPU
KcFFT
KgFFT
1D-FFT on CPU
1D-FFT on GPU
KgMF Memory releasing on GPU
!
"
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
"
! ! & ' "!
#
&'$
%
" " !,+
#!$
% ! &'( '
)* )"* -($),
Figure 4. Relative execution time to trans-
pose a 81922 matrix.
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
We speculate that the reason of this performance de-
!
-"
#
.
/#
$
0%
3
2&'(
6$
)*
23+$
9'
&,.
-/#
#
$%,
crease is an adverse effect on data cache by two threads
-.
/1
0
14
3
2
4 5
567
2
78
3
8 1
9
1&:/
#
#,
$
running simultaneously, since the dual-core CPU used in
!"
:/
##
$
this experiment shares the same L2 data cache between the
two cores. Let r be the distribution ratio to a thread in per-
centage. The thread runs in parallel with the other thread
while computing the first min(r, 100−r) percent of data, and
runs solely for the remaining part, i.e., r − min(r, 100 − r)
percent. Thus, when the ratio is smaller than 50%, it always
runs with the other thread; when it is greater than 50%, it
runs in parallel while processing the first (100−r)% of data,
but exclusively while processing the remaining (2r−100)%.
For example, when r is 60%, the thread runs in parallel with
the other thread while processing 40% of an input matrix,
and exclusively while processing the remaining 20%. Thus, Figure 5. Performance comparison with exist-
the effect by two simultaneous threads should be the most ing 2D-FFT libraries
visible for r being less than 50%, and decrease as it in-
creases from 50% to 100%. In fact, as shown in the figure,
the relative execution time approaches to the single-thread
case as the distribution ratio increased from 50% to 100%, 4.1 Performance of the Prototype FFT Li-
and it steadily increased by nearly 1.5 times when the ra- brary
tio was smaller than 50%, except for the smallest ratio case
(r = 5%). We plan to conduct further analysis with detailed
As a baseline performance study, we compare our li-
profiling of cache usage and other performance metrics.
brary with existing 2D-FFT libraries: the original FFTW
To adjust our linear models to real performance, we in- and CUFFT. Figure 5 shows the execution time of 2D FFT
troduce a function of data distribution ratio r, where 0 ≤ r ≤ while changing the length of each dimension from 256 to
1. The function, Mtr , approximates the real execution time 8192. We use FFTW and CUFFT in our library as a build-
shown in Figure 4, and is defined as follows: ing block that computes 1D FFT. Thus, our current 2D-
FFT implementation incurs overhead of transposing matri-
r − 0.5
1 − Ctr + Ctr if r > 0.5 ces and transferring between CPUs and GPUs In particular,
Mtr (r) =
0.5 (6)
Ctr if r ≤ 0.5
the CUFFT-based library is almost twice as slow as the orig-
inal CUFFT when the data size is 4000. This is an artifact
of generalization and by all means not a shortcoming of our
Here, we use Ctr to fit Mtr to measured performance; in this approach; for example, since we do not have the memory-
particular case, we set Ctr to 1.5. size restriction on the current GPUs, our library can com-
pute much bigger matrices than the original CUFFT. We
also see that the performance of our CUFFT-based imple-
4 Experimental Evaluation
mentation is faster than the GPUFFTW-based one by 32%
when the data size is 8192. This result reflects the organiza-
To evaluate the effectiveness of our proposal, we con- tion of both libraries, i.e., CUFFT requires less data move-
duct two experimental studies: comparison of the library ments than GPUFFTW.
with CPU-only and GPU-only cases, and evaluation of the Next, we evaluate the effect of varying distribution ra-
accuracy of the performance modeling. For both of the tios on total performance using both the CPU and GPU.
studies, we use an x86 machine with Intel Core 2 Duo Figure 6 shows the performance of both the CUFFT-based
E6400 (2.13GHz), 4GB of PC6400 memory, and a GeForce and GPUFFTW-based implementations, where we select
8800GTX GPU. The GPU is equipped with 128 stream pro- the optimal problem distribution ratio (labeled with “GPU
cessors running at 1.35GHz with 768MB of video mem- Ratio: Optimal”), 0% to GPU (“GPU Ratio: 0%”), and
ory. We use Linux kernel 2.6.18 with NVIDIA display 100% to GPU (“GPU Ratio: 100%”). We see that com-
driver version 97.46 and CUDA version 0.8. The versions pared to the GPU-only case, the optimally-distributed ver-
of CUFFT and GPUFFTW are 0.81 and 1.0, respectively. sions of CUFFT and GPUFFTW achieved 19% and 55%
The 2D matrices used in the following experiments consist improvements, respectively, and 33% and 2.4% improve-
of two 32-bit floating-point numbers, each of which repre- ments, compared to the CPU-only case.
sents real or complex part, respectively. Figure 7 shows the execution time of our CUFFT-based
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
!
!
"
#
"
#%
$
% &&'
)(
'
'( *"+
+,
,
!
"
#$
%
$
&)'
'( !"+
+--
!
!"
#%
$ &.%
/ $
!
*"
+
+
,
"
#%
$ &.%
/ $ "+
+,
"
"
#
#$$
%%&
!
Figure 6. Performance comparison with vary-
ing problem sizes
Figure 7. Performance of our library with
library with varying distribution ratios, while the size of varying distribution ratios.
each dimension is fixed at 8192. The X-axis represents the
ratio of computation allocated to the GPU; 100% means
that all the computation is done at the GPU, and 0% that
no computation is done at the GPU. The two graphs show
the execution time when using either a single CPU thread
or two. In both cases, allocating 70% of computation to the
GPU achieved the best performance—1.2 times faster than
the GPU-only case. Compared to the CPU-only case, the
optimal distribution achieved 2.2 times and 1.5 times im-
provements against the single-thread and two-thread cases,
respectively.
As shown in the graph, the performance of the single-
threaded and multi-threaded libraries was nearly the same
when more than 50% of computation was allocated to the Table 2. Learned model parameters using ei-
GPU. Further analysis revealed that the overhead of the ther 512-length or 8192-length profile runs.
GPU control thread nearly occupied a single CPU core, thus 512 8192
using two CPU threads did not improve the performance.
This overhead by the GPU control thread is reportedly re- Ktr 1.11 × 10−8 1.47 × 10−8
duced by the latest CUDA version 1.0; evaluation using the KgMa 5.95 × 10−10 6.93 × 10−11
latest version remains future work. Kc2g 6.16 × 10−9 5.92 × 10−9
Kg2c 5.68 × 10−9 5.09 × 10−9
4.2 Modeling Accuracy KgP 2.73 × 10−11 2.83 × 10−10
KgDP 7.28 × 10−12 2.46 × 10−9
To evaluate the accuracy of our performance modeling KcFFT 2.61 × 10−9 3.21 × 10−9
and its effectiveness in finding optimal distribution ratios, KgFFT 1.65 × 10−10 1.88 × 10−10
we compare the predicted and real execution time of the KgMF 3.01 × 10−9 6.08 × 10−10
CUFFT-based library. Note that we only consider a model
for the single-threaded case; as shown in Section 4.1, the
optimal distribution ratio does not differ between using only
one thread or two.
First, we evaluate the modeling accuracy when predict-
ing the performance of the same matrix size as profile runs.
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
#&$'##
%
"!
%(&)%%
'
$"
#
!
Figure 8. Comparison of the real performance
of 8192-length 2D FFT with the predicted per-
formance using profile runs with the same
data size. Figure 9. Comparison of the real performance
of 1024-length 2D FFT with the predicted per-
formance using 512-length profile runs.
fied the optimal distribution ratio, while the execution time
prediction for each distribution ratio had a small error: The
maximum error was 2.9%, and the majority was under 5%.
%
#&$#
Next, we evaluate the modeling accuracy when predict-
"
!
ing the performance of different matrix size as profile runs.
As profile runs, we use the performance profiles of 512-
length 2D FFT using either the CPU or the GPU. Fig-
ure 9 compares the predicted and real performance of 1024-
length 2D FFT. We see that our performance model suc-
cessfully found the optimal distribution ratio; the average
execution-time prediction error was approximately 4.5%,
and the maximum was less than 10.6%.
Figure 10 shows the predicted and real performance of
8192-length 2D FFT. Here, we see relatively large predic-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
in total, we saved 0.79 second and 1.63 seconds, compared
to GPU-only and CPU-only cases, respectively.
!"*#+,#+-$
) .%&' #'
(
5 Related Work
/
$ #
% 0& ,,"1
142%%
&3*- We relate our approach to past attempts to execute
$ % &! 5!
68#7!!8
FFT on GPUs as well as performance modeling of high-
%3* -
performance GPGPU. Note that little research has consid-
4 % & ,516$%&
ered using both CPUs and GPUs simultaneously; as far as
4
8"%* &
- 0 9 1#
07
# !!8 we know, there has been no study on performance modeling
of FFT using both CPUs and GPUs.
$ %& ' *317 Moreland and Angel presented their method to execute
2D FFT on NVIDIA GeForce FX 5800 using graphics-
specific APIs, including OpenGL and Cg [10]. Fi-
alka and Čadı́k further improved FFT performance by
using NVIDIA GeForce 6600GT with DirectX9.0C and
HLSL [4].
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
ing size identified by the model. Our contribution is orthog- References
onal to theirs: combining both models could automate iden-
tifying both optimal sizes of blocking and load distribution [1] F. Almeida, D. Gonzalez, and L. M. Moreno. The master-
among heterogeneous CPU-GPU configurations. Such an slave paradigm on heterogeneous systems: A dynamic pro-
extension is a subject of our future work. gramming approach for the optimal mapping. Elsevier Jour-
nal of Systems Architecture, 52:105–116, 2006.
6 Conclusion [2] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,
M. Houston, and P. Hanrahan. Brook for gpus: stream
computing on graphics hardware. ACM Trans. Graph.,
We have presented our 2D-FFT library for heteroge- 23(3):777–786, 2004.
neous CPU-GPU configurations along with its performance [3] ClearSpeed Technology plc. ClearSpeed white paper: Csx
model. To find optimal data distribution ratios among CPUs processor architecture. http://www.clearspeed.com/.
and GPUs, we construct a performance model of our library [4] O. Fialka and M. Čadı́k. Fft and convolution performance in
by dividing the 2D-FFT computation into small sub exe- image filtering on gpu. In IV ’06: Proceedings of the con-
cution steps, and determine the performance of each step ference on Information Visualization, pages 609–614, Wash-
by preliminary profiling runs. Using the determined per- ington, DC, USA, 2006. IEEE Computer Society.
formance of each sub step, we derive a mathematical model [5] M. Frigo and S. G. Johnson. The design and implementa-
that predicts the total execution times of 2D FFT of arbitrary tion of FFTW3. In Proceedings of the IEEE: Special issue
data sizes. For a particular 2D matrix, we determine the op- on Program Generation, Optimization, and Platform Adap-
tation, volume 93, pages 216–231, 2005.
timal load distribution ratio by finding the shortest predicted
[6] N. Galoppo, N. K. Govindaraju, M. Henson, and
execution time. Our preliminary evaluation has shown that
D. Manocha. LU-GPU: Efficient algorithms for solving
the model can predict the execution time of 16 times as large dense linear systems on graphics hardware. In SC ’05: Pro-
problems as the profile runs with less than 15% error, and ceedings of the 2005 ACM/IEEE conference on Supercom-
that the predicted optimal load distribution ratios have less puting, page 3, Seattle, WA, 2005.
than 1% error. The overall performance improvements with [7] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A
the performance model ranged from 2.4% to 50% compared memory model for scientific algorithms on graphics proces-
to the CPU-only and GPU-only configurations. sors. In Proceedings of the 2006 ACM/IEEE conference on
Our future work includes the following. First, for GPUs Supercomputing (SC’06), page 6, Tampa, FL, 2006.
with CUDA, we plan to optimize the 2D-FFT algorithm by [8] N. K. Govindaraju and D. Manocha. GPUFFTW: High per-
using the CUDA-specific features for more efficient use of formance power-of-two fft library using graphics processors.
the potential performance of GPUs. Specifically, it would http://gamma.cs.unc.edu/GPUFFTW/.
enable to transpose matrices on GPUs as well as CPUs, thus [9] J. Makino, E. Kokubo, and T. Fukushige. Performance eval-
uation and tuning of grape-6 - towards 40 ”real” tflops. In
reducing the data communication overhead substantially. In
SC ’03: Proceedings of the 2003 ACM/IEEE conference on
addition, the current released version (v1.0) of CUDA re- Supercomputing, page 2, Phoenix, AZ, 2003.
portedly reduces the overhead of the GPU control thread [10] K. Moreland and E. Angel. The FFT on a GPU. In
than the version we have used in this work (v0.8). We ex- SIGGRAPH/Eurographics Workshop on Graphics Hardware
pect that the performance of our CPU-GPU library using the 2003 Proceedings, pages 112–119, July 2003.
newer CUDA would improve as well. Second, we will ex- [11] NVIDIA Corporation. NVIDIA CUDA Compute
plore the performance effects of using more CPU cores. We Unified Device Architecture Programming Guide.
expect that such configurations would magnify the advan- http://developer.nvidia.com/object/cuda.html.
tage of our model-based approach. Finally, we will extend [12] S. Ohshima, K. Kise, T. Katagiri, and T. Yuba. Parallel pro-
our library to exploit the parallelism not only in a single cessing of matrix multiplication in a cpu and gpu hetero-
machine but also in a cluster of machines with GPUs. Our geneous environment. In VECPAR’06 - Seventh Interna-
extended performance model will enable to use such further tional Meeting on High Performance Computing for Com-
putational Science, pages 305–318, 2006.
heterogeneous computing resources without much human
[13] K. D. Underwood, K. S. Hemmert, and C. Ulmer. Architec-
burden. tures and apis: Assessing requirements for delivering fpga
performance to applications. SC, 00:49, 2006.
Acknowledgments
Biographies
We wish to thank Naga Govindaraju and Akira Nukada
for earlier ideas of this research. This work is supported Yasuhiko Ogata received his B.Sc in computer sci-
in part by Microsoft Technical Computing Initiative, and in ence from Tokyo Institute of Technology in 2007, and is
part by Japan Science and Technology Agency as a CREST a graduate student in the Mathematical and Computing
research program entitled “Ultra-Low-Power HPC”. Sciences, Tokyo Institute of Technology. His major
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
research interests include effective use of multi-core and
other parallel and distributed systems.
Toshio Endo is an associate professor at the Global COE
program, ”Computationism as a Foundation for the Sci-
ences”, at Tokyo Institute of Technology. He received his
Ph.D. from the University of Tokyo in 2001. His major re-
search interests include distributed computing and parallel
computing on heterogeneous architectures. He is a member
of IEEE-CS and ACM.
Naoya Maruyama is a Ph.D. candidate graduate student
at Tokyo Institute of Technology, expected to graduate in
March 2008. He received his Master’s degree at Tokyo In-
stitute of Technology in 2003. His research interests in-
clude cluster and Grid computing, statistical techniques for
system management, and program analysis. His recent re-
search focuses on fault detection and localization through
online system monitoring and modeling.
Satoshi Matsuoka received his Ph. D. from the Univer-
sity of Tokyo in 1993. He became a full Professor at
the Global Scientific Information and Computing Center
(GSIC) of Tokyo Institute of Technology (Tokyo Tech /
Titech) in April 2001, leading the Research Infrastructure
Division Solving Environment Group of the Titech cam-
pus. He has pioneered grid computing research in Japan the
mid 90s along with his collaborators, and currently serves
as sub-leader of the Japanese National Research Grid Ini-
tiative (NAREGI) project, which aims to create middleware
for next-generation CyberScience Infrastructure. He was
also the technical leader in the construction of the TSUB-
AME supercomputer, which has become the fast supercom-
puter in Asia-Pacific in June, 2006 at 85 Teraflops (peak,
now 111 Teraflops as of March 2009) and 38.18 Teraflops
(Linpack, 7th on the June 2006 list) and also serves as the
core grid resource in the Titech Campus Grid. He has been
(co-) program and general chairs of several international
conferences including ACM OOPSLA’2002, IEEE CCGrid
2003, HPCAsia 2004, Grid 2006, CCGrid 2006/2007/2008,
as well as countless program committee positions, in par-
ticular numerous ACM/IEEE Supercomputing Conference
(SC) technical papers committee duties including serving
as the network area chair for SC2004, SC2008, and will be
the technical papers chair for SC2009. He served as a Steer-
ing Group member and an Area Director of the Global Grid
Forum during 1999-2005, and recently became the steer-
ing group member of the Supercomputing Conference. He
has won several awards including the Sakai award for re-
search excellence from the Information Processing Society
of Japan in 1999, and recently received the JSPS Prize from
the Japan Society for Promotion of Science in 2006 from
his Royal Highness Prince Akishinomiya.
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.