You are on page 1of 9

426 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 7, NO.

4, DECEMBER 2003

FAIR: A Hardware Architecture for Real-Time 3-D


Image Registration
Carlos R. Castro-Pareja, Member, IEEE, Jogikal M. Jagadeesh, Member, IEEE, and Raj Shekhar, Member, IEEE

Abstract—Mutual information-based image registration, cific regions of the brain [6], [7]. In general, the registration of
shown to be effective in registering a range of medical images, single-modality images allows monitoring changes over time,
is a computationally expensive process, with a typical execution whereas the registration of multimodality images combines the
time on the order of minutes on a modern single-processor com-
puter. Accelerated execution of this process promises to enhance complementary structural and functional information about a
efficiency and therefore promote routine use of image registration certain organ.
clinically. This paper presents details of a hardware architec- There are many approaches to three-dimensional (3-D) image
ture for real-time three-dimensional (3-D) image registration. registration [8]. To justify hardware implementation, it is im-
Real-time performance can be achieved by setting up a network perative that the selected registration algorithm be as general
of processing units, each with three independent memory buses:
one each for the two image memories and one for the mutual and as flexible as possible. Moreover, this algorithm should be
histogram memory. Memory access parallelization and pipelining, accurate and robust and must not require manual interaction.
by design, allow each processing unit to be 25 times faster than Voxel similarity-based algorithms fulfill the above criteria better
a processor with the same bus speed, when calculating mutual than feature-based approaches [9]. In this paper, we present
information using partial volume interpolation. Our architecture the hardware implementation of an algorithm that uses the mu-
provides superior per-processor performance at a lower cost
compared to a parallel supercomputer. tual information measure of voxel similarity. Mutual informa-
tion-based image registration can be fully automatic and appli-
Index Terms—Biomedical image processing, digital systems, cable to single- or multimodality images of most organs and
image registration, pipeline processing.
supports both rigid and nonrigid transformation modes. Most
importantly, it is one of the most reliable, robust and promising
I. INTRODUCTION methods currently available [9]–[12].
Mutual information-based image registration relies on the
M EDICAL image registration is the process of aligning
two or more images that represent the same anatomy
at different times, from different viewing angles or using dif-
maximization of the mutual information between two images.
Mutual information is a function of two 3-D images and a
ferent sensors. These images can be of either the same subject transformation between them. The 4 4 transformation matrix
contains information about rotation, translation, scaling, and
or different subjects. Image registration in medical imaging is
shear, in the most general case. Mutual information-based
used to merge or compare images obtained from a variety of
registration uses an optimization algorithm that searches for
modalities, such as magnetic resonance imaging (MRI), com-
the transformation matrix that orients the two images such that
puted tomography (CT), positron emission tomography (PET),
the mutual information between them is maximized. Powell’s
single photon emission computed tomography (SPECT), and ul-
method, the downhill Simplex method and simulated annealing
trasound. Common medical applications of image registration
are the optimization algorithms commonly employed for the
are multimodality fusion of anatomical (CT or MRI) and func-
task [13].
tional (PET or SPECT) images for accurate localization of ac-
Mutual information-based 3-D image registration is an au-
tive tumors, as well as delineation of their shape and size [1]–[3],
tomatic but computationally intensive task, whose typical exe-
registration of serial images for monitoring the progression or
cution time is on the order of minutes on most modern desktop
regression of a disease [4], and postoperative follow-up [5], and
computers [14], [15]. The total execution time can easily exceed
brain atlas registration, in which a brain image of a given patient
an hour when registering 3-D cardiac image sequences (10–30
is morphed into a predefined template to identify and label spe-
images per sequence), an emerging image registration applica-
tion [16]. Previous attempts to accelerate image registration by
Manuscript received June 2, 2003; revised September 6, 2003 and September using parallel supercomputers [14], [17], [18] achieved signif-
15, 2003. This work was supported by the Department of Defense Research icant speed increases, but with a speedup-per-processor ratio
Grant DAMD17-99-1-9034.
C. R. Castro-Pareja is with the Department of Electrical Engineering, The smaller than one. Due to communication delays, this ratio tends
Ohio State University, Columbus, OH 43210 USA, and also with the Depart- to decrease as the number of processors increases. Rohlfing et
ment of Biomedical Engineering, Lerner Research Institute, The Cleveland al. [14] report speedup-per-processor ratios between 1.00 and
Clinic Foundation, Cleveland, OH 44195 USA.
J. M. Jagadeesh is with the Department of Electrical Engineering and the 0.32 for single and 64 processor configurations, respectively.
College of Pharmacy, The Ohio State University, Columbus, OH 43210 USA. While such research is very valuable in understanding paral-
R. Shekhar is with the Department of Biomedical Engineering, Lerner Re- lelism, general purpose supercomputers are expensive and usu-
search Institute, The Cleveland Clinic Foundation, Cleveland, OH 44195 USA
(e-mail: shekhar@bme.ri.ccf.org). ally have limited availability for applications in clinical environ-
Digital Object Identifier 10.1109/TITB.2003.821370 ments.
1089-7771/03$17.00 © 2003 IEEE
CASTRO-PAREJA et al.: FAIR: ARCHITECTURE FOR REAL-TIME 3-D IMAGE REGISTRATION 427

Our research focused on accelerating the calculation of The joint voxel intensity probability , i.e., the
mutual information by analyzing the main speed bottlenecks probability of a voxel in the reference image having an inten-
and overcoming them by developing an optimized hardware sity given that the corresponding voxel in the floating image
architecture for efficient calculation of the mutual information, has an intensity , can be obtained from the joint or mutual his-
with the goal of achieving registration times on the order togram of the two images. The mutual histogram represents the
of a second, using fewer processors than necessary when joint intensity probability distribution. In the process of mutual
using microprocessor-based computers. Mutual information information-based registration, the dispersion of values within
calculation is a memory-intensive task that does not fully the mutual histogram is minimized, which in turn minimizes the
benefit from cache-based memory architectures present in joint entropy and maximizes the mutual information.
most modern computers. Our novel Fast Automatic Image Calculation of the mutual information can be divided into two
Registration (FAIR) architecture for hardware-accelerated au- steps. The first step is to compute the mutual histogram. In the
tomatic image registration, presented in this article, enables 3-D second step, both individual and joint entropies are calculated
image registration at speeds of at least one order of magnitude from the mutual histogram data. These entropies are then used
above the fastest CPU-only software implementation with a to obtain the mutual information as per (2.2). Using mutual his-
higher speedup-per-processor ratio than with standard parallel togram to compute individual entropies also ensures that only
computers. The improved speedup-per-processor ratio results those voxels inside the volume of overlap figure in the mutual
from a custom pipeline, which reduces the serial component of information calculation.
the algorithm by employing parallel memory access techniques Since the transformed location of a voxel of the floating
previously used in volume rendering hardware to increase the image may not coincide with the location of a voxel in the
voxel access rate [19]. As with parallel computers, distributed reference image, interpolation is needed. Interpolation also
processing can be used to further enhance the processing helps in obtaining subvoxel accuracy. Typical interpolation
speed. Having a higher speedup-per-processor ratio allows us algorithms include nearest neighbor, trilinear interpolation and
to achieve real-time registration with fewer processing units. partial volume interpolation. Partial volume interpolation, as
Our solution results in a physically smaller, more economical suggested by Maes et al. [12], is used to map the voxels in the
and highly scalable system, which, we believe, will promote a reference image to their corresponding locations in the floating
wider use of medical image registration and extend its use into image. Nearest neighbor interpolation was not considered
new areas such as four-dimensional (4-D) (3-D space time) because it does not provide subvoxel accuracy. Both trilinear
cardiac image registration [20] and image-guided surgery. interpolation and partial volume interpolation provide subvoxel
In the latter application, intraoperative images, such as those accuracy. However, Maes et al. [12] showed that trilinear in-
obtained from a real-time 3-D ultrasound or MRI scanner, can terpolation typically introduces new intensity levels as a result
be used to warp (update) the high-resolution preoperative MRI of interpolation, causing unpredictable variations in mutual
or CT images [21]–[25]. histogram values as the transformation matrix changes. On the
other hand, partial volume interpolation accumulates the eight
II. ALGORITHM interpolation weights directly into the mutual histogram instead
of calculating a resulting intensity level and incrementing
A. Registration by Maximization of Mutual Information
that intensity level’s mutual histogram count by one, as in
Image registration by maximization of mutual information trilinear interpolation. Therefore, the main advantage of partial
was introduced by Wells et al. [11]. The method attempts to volume over trilinear interpolation is that it produces a mutual
find the transformation that best aligns a reference image , histogram, whose values change smoothly with small changes
with coordinates , , and , and a floating image . in the transformation, thus resulting in a smoother mutual
information surface. Fig. 1 shows a comparison between
(2.1) calculating the mutual histogram using trilinear interpolation
and partial volume interpolation. Capek et al. [10] showed the
Mutual information is calculated from individual and joint difference in mutual information surface smoothness when
entropies using the following equation using different interpolation schemes for mutual information
or generalized mutual information (a slight variant of mutual
information) calculation. The authors concluded that mu-
(2.2)
tual information, computed according to Maes, provides the
smoothest mutual information surface among statistical voxel
The individual entropies and and the joint
similarity measures.
entropy are computed as follows:

(2.3) B. Transformation Calculation

(2.4) The transformation that maps the floating image to the refer-
ence image is defined by a 4 4 transformation matrix . The
(2.5) transformation matrix contains information about the rotation,
translation, scaling and shear parameters that are inherent to the
428 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 7, NO. 4, DECEMBER 2003

Fig. 1. Mutual histogram calculation using (a) trilinear interpolation and (b) partial volume interpolation.

transformation being performed. The scaling parameters also in-


corporate voxel scaling necessary to compare images with dif- (2.11)
ferent voxel sizes. Voxel scaling factors are constant for rigid
registration and as such are excluded from optimization. Since
the inverse transformation is needed to generate and vi- (2.12)
sualize the transformed image, the algorithm tries to estimate it
instead of the direct transformation. The location of a given ref- (2.13)
erence image voxel in the floating image is given by

(2.6) (2.14)

The process of accumulating the interpolation weights for


Since the elements of have both integer and fractional
a voxel in the floating image into the mutual histogram can
components, the transformed location of a reference image
be divided into three steps: (a) calculation as per (2.6), (b)
voxel also has an integer component
calculation of interpolation weights [right-hand sides (RHS) of
and a fractional component . The integer
(2.7)–(2.14)], and (c) accumulation of interpolation weights into
component is used to obtain the base memory address of the
the mutual histogram as per (2.7)–(2.14).
corresponding neighborhood in the floating image.
According to the partial volume interpolation algorithm, the C. Complexity Analysis
interpolation weights are calculated from the fractional com-
ponents, which are accumulated into the mutual histogram at Mutual information-based registration usually requires hun-
the coordinates set by their corresponding reference image and dreds of iterations (mutual information evaluations), depending
floating image values. Equations (2.7)–(2.14) show the partial on the optimization algorithm used to maximize the mutual in-
volume interpolation calculation process, where , and are formation function, the image complexity, and the degree of
unit vectors in the , , and directions, respectively. initial misalignment. If real-time performance (in the order of
a second) is required, the mutual information calculation time
should be on the order of tens of milliseconds.
Constructing the mutual histogram, the first step in mutual
(2.7)
information calculation, involves performing partial volume in-
terpolation times, where is less than or equal to the number
(2.8) of voxels in the reference image. The number of operations in
the second step, the calculation of mutual information as per
(2.2), is a function of the size of the mutual histogram. Since the
(2.9)
number of bins in the mutual histogram is usually an order of
magnitude smaller than , it is the calculation of the mutual his-
(2.10) togram that dominates the mutual information calculation time.
CASTRO-PAREJA et al.: FAIR: ARCHITECTURE FOR REAL-TIME 3-D IMAGE REGISTRATION 429

Table I shows the fractions of the total processing time spent TABLE I
on mutual histogram calculation and the remaining computa- MUTUAL HISTOGRAM CALCULATION COMPONENT PER IMAGE SIZE
tions for different image sizes. Mutual histogram calculation
consumes approximately 99% or more of the total computation
time for most practical medical images, and its computational
share varies only slightly depending on the specific algorithm
implementation and the processor, memory architecture, com-
piler and operating system used to run the algorithm.
Since the majority of the registration execution time is spent
on calculating the mutual histogram, accelerating mutual his-
togram calculation has been the focus of our work. Our analysis we obtained the total registration time and the time spent on mu-
shows that the partial volume interpolation is the primary per- tual histogram calculations. Mutual histogram calculation time
formance bottleneck in mutual histogram calculation. At cur- depends on the image size and is shown as a fraction of the
rent microprocessor speeds, the time of mutual histogram cal- total calculation time in Table I. Since mutual histogram cal-
culation for 3-D images is dictated almost exclusively by the culation time depends on the number of voxels in the image, the
memory access time. From (2.7)–(2.14), 25 memory accesses maximum speedup predicted by Amdahl’s law also depends on
are needed to perform partial volume interpolation per voxel the number of voxels. Typical 3-D medical images are approxi-
of the reference image: 1 to access the reference image voxel mately to voxels large. For this range, the expected max-
, 8 to access the 8-voxel neighborhood in the floating image imum registration speedup is approximately between 90 and
and 16 accesses to the mutual histogram memory (8 reads and 8 3000, when the execution time of the parallel part becomes neg-
writes). Accesses to the reference image are sequential and ben- ligible compared with the execution time of the serial part. The
efit from standard caching techniques. The mutual histogram minimum calculation time achieved by this speedup is constant
memory has a small size (256 256 or 64 K values) and thus ac- for a given dataset, and it is equivalent to the time the computer
cesses to it have high locality of reference. However, the floating spends on calculating mutual information from the mutual his-
image is accessed in a direction across the image that depends togram (accumulation of logarithms) for all iterations leading
on the transformation being applied. Unless there is no rota- up to the optimal transformation and executing the optimiza-
tion component, this direction is not parallel to the direction in tion algorithm. The majority of this time is spent on computing
which voxels are stored, hence accesses have poor locality and logarithms, which on average takes 10–30 ms per iteration on
do not benefit from memory-burst accesses or memory-caching most modern computers. In comparison, the time spent on the
schemes. Accesses to the floating image therefore depend al- optimization algorithm is negligible (less than 0.1 ms). Consid-
most exclusively on the memory bus speed. Since memory ac- ering that a complete registration usually requires hundreds of
cess time does not evolve according to Moore’s law, mutual in- iterations, the acceleration of mutual histogram calculation can
formation-based registration times are not expected to be signif- reduce registration time to no more than a few seconds.
icantly reduced in the near future by enhanced single-processor
computer architectures. III. MUTUAL HISTOGRAM CALCULATION: SIMILARITIES WITH
VOLUME RENDERING
D. Impact of Accelerating Mutual Histogram Calculation Mutual histogram calculation has many similarities with the
The speedup in registration achieved by accelerating the ray casting algorithm used for volume rendering. In both cases,
mutual histogram calculation depends on the share of the a 3-D image is traversed by casting rays through the 3-D dataset
overall registration execution spent on calculating the mutual and performing interpolation to obtain equally spaced samples
histogram. Equation (2.15) shows Amdahl’s law [26], which along the ray. In the case of mutual histogram calculation,
gives the resulting overall speedup for a process when a part of a second volume is traversed too. The reference volume is
it is accelerated. The serial part corresponds to the pro- traversed by casting rays parallel to the axis, coinciding with
portion of the overall process execution time that is not being the data rows. These same rays are cast through the floating
accelerated, while the parallel part corresponds to images; however, they may start and end either inside or
the proportion that is being accelerated. outside the volume of the floating image, depending on the
characteristics of the transformation being applied (rotation,
translation, scaling, shearing). Fig. 2 shows an example of a
set of rays being cast through the reference image and their
(2.15) possible corresponding accesses in the floating image. Both
volume rendering and mutual histogram calculation try to
In our case, the serial part corresponds to the time spent on condense 3-D information into a 2-D matrix: the display buffer
the optimization algorithm and on the accumulation of loga- in the case of volume rendering, and the mutual histogram
rithms performed to obtain the mutual information value from matrix in the case of mutual histogram calculation. A difference
the mutual histogram (2.2)–(2.5), while the parallel part corre- is that volume rendering employs trilinear interpolation, which
sponds to the mutual histogram calculation. To determine the provides acceptable results for volume visualization, while
effective registration speedup resulting from accelerating mu- mutual histogram calculation, as presented here, makes use of
tual histogram calculation, we ran several experiments in which partial volume interpolation. This difference changes the focus
430 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 7, NO. 4, DECEMBER 2003

total time of mutual histogram calculation becomes equal to


the average pipeline stage calculation time times the number of
voxels in the reference image.
The first stage accesses the reference image memory se-
quentially and calculates the coordinates of the corresponding
floating image voxels. The integer part of the coordinates is
passed to the floating image RAM controller as the floating
image address, while the fractional part is passed to the interpo-
lator. The second stage calculates the interpolation weights and
accesses the floating image. The third stage accumulates the
Fig. 2. Ray casting through (a) the reference image and (b) through the floating interpolation weights into the mutual histogram at the positions
image . given by the reference image value and the corresponding eight
floating image voxel values.
of optimization on how volume memory is accessed. In volume
rendering, the major bottleneck in the interpolation pipeline is B. Parallel Memory Access Scheme
volume memory access, which is why it is highly desirable to
obtain near static RAM (SRAM) performance for the volume Each pipeline stage must have its own memory bus for
memory. This is solved by using custom memory-addressing parallel execution. The first and second stage memory buses
and memory-caching techniques [19], [27]–[29] that permit are used for the reference image and floating image memory
parallel access of full voxel neighborhood (useful accesses, respectively, while the third stage memory bus is used
for interpolation) and/or allowing the parallel access of whole for mutual histogram memory accesses. The three memory
rays along the volume (for compositing). In the case of mu- buses differ in size, speed and access requirements. The refer-
tual histogram calculation, a similar bottleneck exists when ence image memory is accessed sequentially, while the floating
accessing the floating image. Because the reference image is image and mutual histogram memories are accessed randomly
accessed sequentially, it can be performed efficiently using (i.e., nonsequentially). The reference image and floating image
memory-burst transfers. On the other hand, the number of memories have a size of 16 M 9 each and the mutual
accesses to the mutual histogram RAM required for each inter- histogram memory has a size of 64 K 32, corresponding to
polated voxel is 16 times higher because of the use of partial a mutual histogram of size 256 256. Since a pipeline’s stage
volume interpolation. Therefore, the performance bottlenecks latency equals the latency of the slowest stage, it is important to
in mutual histogram calculation are in accessing floating image minimize the memory access time of those stages that require
and mutual histogram memories. more than one access per voxel.
The mutual histogram memory has the most stringent access
IV. FAIR ARCHITECTURE speed requirements since it needs to be accessed 16 times per
interpolation. Because the mutual histogram memory is small,
The FAIR architecture is designed to accelerate mutual his- it can be implemented using high-speed SRAM, which is at least
togram calculation by accelerating partial volume interpolation. an order of magnitude faster than the dynamic RAMs used for
As described in Sections II-B and II-C, partial volume interpo- the two images.
lation is performed by accessing one reference image voxel at a
Because the 3-D images are large, the use of high-speed
time, calculating the coordinates of the corresponding floating SRAM is currently not cost effective for implementing image
image voxel neighborhood, accessing the floating image voxel
memories. Between the two images, the reference image has
locations of the eight elements of the neighborhood, calculating
more relaxed requirements, since it is accessed sequentially (in
the corresponding interpolation weights and finally updating
an x-y-z order) to perform interpolation. This kind of access
the mutual histogram. These tasks need to be repeated for each
benefits from burst accesses and memory caching techniques,
voxel in the reference image. In a standard (single processor)
making the use of a single Synchronous Dynamic RAM
software implementation of the algorithm, these tasks must be
(SDRAM) bus for image storage a viable option. On the other
executed before processing the next voxel, which means that
hand, the floating image must be accessed randomly and has
the total mutual histogram calculation time equals the product
to provide eight voxel values per reference image voxel, which
of single voxel interpolation time and the number of voxels in
calls for a different way to store the floating image data.
the reference image. The FAIR architecture optimizes partial
To overcome the floating image memory access bottleneck,
volume interpolation by means of pipelining, parallel memory
the FAIR architecture employs memory-addressing techniques
access, and distributed processing.
similar to the ones used in volume rendering hardware to speed
up interpolation. Doggett and Meißner [19] presented a par-
A. Pipeline allel memory addressing scheme called Cubic Addressing. The
The first level of algorithm parallelization comes from main advantage of Cubic Addressing is that it enables parallel
pipelined execution. The FAIR architecture uses the 3-stage access to a voxel neighborhood, thus greatly re-
pipeline shown in Fig. 3. This arrangement takes advantage of ducing the number of sequential memory accesses otherwise
the independence between the three partial volume interpola- needed for interpolation. It uses eight parallel memory modules
tion tasks listed in Section II-B. With pipelined execution, the and a custom address decoder to calculate the voxel addresses.
CASTRO-PAREJA et al.: FAIR: ARCHITECTURE FOR REAL-TIME 3-D IMAGE REGISTRATION 431

Fig. 3. A FAIR processing unit.

floating image memory. This scheme takes advantage of the fast


burst-mode access of SDRAMs to reduce the number of nec-
essary parallel memory modules. Size-2 burst accesses take the
same amount of time as single accesses on SDRAMs. By storing
neighboring voxels sequentially along the direction, a size-2
burst access to the four memory modules, starting on the neigh-
borhood voxels with the lowest coordinate, will retrieve infor-
mation about two adjacent 2 2 - -plane neighborhoods. A
caching scheme here would not enhance performance because
the SDRAM random access time is comparable to the time re-
quired to perform the 16 mutual histogram SRAM accesses.

C. Distributed Processing

Mutual histogram calculation lends itself well to paral-


lelization by dividing up the reference image into a number of
nonoverlapping subvolumes and distributing them among an
equal number of processing units, each with its own mutual
histogram calculation pipeline [14]. Each processing unit has
the necessary RAM to store its part of the reference image, the
full floating image and the mutual histogram. A full copy of
the floating image is needed at each processing unit because
the portion of the floating image that matches the reference
Fig. 4. Memory module assignments of voxels in Cubic Addressing. image subvolume depends on the transformation and can be
(a) Original scheme described in [19]. (b) Burst-access cubic addressing located anywhere inside the floating image (Fig. 5). Each
implemented in the FAIR architecture.
processing unit will calculate the partial mutual histogram
corresponding to its subvolume. When all processing units are
This memory arrangement is shown in Fig. 4(a). A similar ad- finished computing, the full mutual histogram can be obtained
dressing scheme that employs only four parallel memory mod- by adding the partial mutual histograms. As shown in Fig. 3,
ules [Fig. 4(b)] is used in the FAIR architecture to access the each processing unit has an input port to add the partial mutual
432 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 7, NO. 4, DECEMBER 2003

TABLE II
COMPARISON OF MUTUAL INFORMATION CALCULATION TIME IN MS

FPGAs was the number of I/O pins. Internal resource utiliza-


tion was about 60%. The image RAMs were implemented
using PC100 SDRAMs, and the mutual histogram RAM was
Fig. 5. (a) Division of the reference image into four subvolumes to be implemented using high-speed SRAMs. The board operated at
assigned to four individual FAIR processing units for distributed computing.
(b) Possible gray level-coded floating image regions corresponding to reference
a clock rate of 80 MHz. It was connected to a host computer
image subvolumes. using a PCI-7200 digital I/O board by ADLINK Technology
(Taipei, Taiwan). In Tables II and III, the actual timings are
histogram of the previous unit and an output port to transmit compared with analogous software implementation timings on
the result to the next unit or the host computer. The resulting a 1-GHz Pentium III computer with a 133 MHz memory bus. In
mutual histogram calculation time can be obtained as our implementation, the ratio was equal to 2.67. The
upper limit of the speedup presented in Table III was calculated
(4.1) using (4.2). The size of the mutual histogram was 256 256.
The software calculation time depends on the number of cache
misses, which in turn depends on the direction in which the
where is the calculation time per voxel (equal to the pipeline
floating image is accessed. To reflect the different possible
stage latency); , the number of voxels in the reference image;
cases, the average software calculation times were obtained by
, the size of the mutual histogram matrix (usually bins);
performing a series of mutual information calculations with
, the time required to transmit one mutual histogram value
transformations covering the whole range of image rotation
between two processing units or between a processing unit and
values with 5-degree increments. Our system achieved a
the host computer and , the number of processing units. The
significant speedup-per-processor ratio in all cases. Using a
first term in the formula is the time it takes for each unit
faster memory bus (i.e., increasing ) and a faster I/O
to calculate its share of the mutual histogram. The second term
bus to transmit the mutual histogram to the host computer (i.e.,
corresponds to the time spent on transmitting the mutual his-
reducing ) would increase the speed further.
togram to the host computer. The third latency term, which is
negligible, accounts for the time required to fill the mutual his-
B. Accuracy
togram calculation pipelines and the partial mutual histograms
addition pipeline. Given that interpolation of a voxel requires 25 Mutual histogram calculation involves transforming the co-
memory accesses, the upper bound for the mutual histogram cal- ordinates of the voxels of the reference image to obtain their
culation speedup factor of an implementation with processing corresponding locations in the floating image, and performing
units with respect to a computer (ignoring ) is as follows: partial volume interpolation to calculate the weights to be added
to the mutual histogram. Typical mutual histogram sizes are be-
(4.2) tween 32 32 and 256 256. Studholme [30] showed that
varying the mutual histogram size in this range does not affect
the outcome of registration significantly.
where is the computer’s RAM bus clock frequency The largest possible mutual histogram entry is equal to the
and is the processing unit’s SDRAM clock fre- size of the reference image—a situation that arises when each
quency. This upper bound is the worst-case scenario and image is uniform (has a single intensity). Because 3-D images
assumes that there are no cache hits when calculating the used in medical applications commonly have on the order of
complete mutual histogram on the computer. Furthermore, the voxels ( sized image), the smallest word
maximum theoretical speedup-per-processor achievable using length for the mutual histogram RAM should be 24 bits, for pos-
this architecture is 25, when and itive integer mutual histogram values. It is also necessary to have
. support for fractional values inside the mutual histogram since
partial volume interpolation provides a set of eight fractional
V. IMPLEMENTATION AND RESULTS values that are accumulated into the mutual histogram. So any
numerical representation used to calculate the mutual histogram
A. Time should have more than 24 bits in the mantissa. This require-
As a proof of concept, the FAIR architecture was imple- ment rules out the use of single-precision floating-point num-
mented in an external prototype board using Altera ACEX bers to accumulate and store the mutual histogram, since their
1K100 FPGAs. We implemented two processing units per mantissa is only 23 bits long. Better alternatives for mutual his-
board, each using two FPGAs. The limiting resource on the togram accumulation and storage are double-precision floating-
CASTRO-PAREJA et al.: FAIR: ARCHITECTURE FOR REAL-TIME 3-D IMAGE REGISTRATION 433

TABLE III
MUTUAL INFORMATION SPEEDUP RATES

point numbers and fixed-precision numbers with a word length even when using just one processing unit, was significant. For
greater than 24 bits. Double-precision floating-point numbers additional speed, the modularity of the architecture can be
have a dynamic range that is far in excess of what is neces- exploited to efficiently implement arrays of processing units
sary to perform mutual histogram calculation since all values using VLSI or FPGAs to perform distributed image registration.
are aligned (making the exponent bits unnecessary). Further- The FAIR architecture allows reaching the maximum possible
more, implementing floating-point arithmetic requires consid- speed predicted by Amdahl’s law using far fewer processing
erably more resources than fixed-point, so we decided to use a units than standard multiprocessor computers. Distributed
fixed-point representation to accumulate and store the mutual processing alone can be implemented using multiprocessor
histogram. computers or computer clusters, but the benefits of using
In our prototype system, partial volume interpolation was parallel memory access in the pipeline are not gained, thus
implemented using a 32-bit, fixed-point approach. The system yielding significantly lower speedup-per-processor ratios,
used 8 bits for the fractional part, resulting in an accuracy of as reported earlier [14]–[18]. Custom processing units are
1/256th of a voxel dimension, and 24 for the integer part. To faster, more compact, more power conserving and significantly
validate our approach, the fixed-point implementation was less expensive than the nodes of a parallel supercomputer,
compared with a C++ implementation using double-precision resulting in a smaller and more economical system suitable
floating-point accuracy. The rounding effect from the use of for clinical use. The cost of our prototype board, housing two
fixed-point arithmetic produced an offset error that resulted units (processors), was approximately two thousand dollars.
in mutual information surface being elevated with respect to A comparable speedup can be obtained using a 16-processor
its analytical version, and a small reduction in the dynamic parallel computer, but at a cost of tens to hundreds of thousands
range of the mutual information values across the mutual of dollars. Real-time 3-D image registration made possible by
information surface by a small factor (less than 5% for 32-bit the FAIR architecture may lead to wider adoption of this useful
fixed-point), equivalent to a global linear scaling. These errors technology in the diagnosis and treatment of human diseases.
neither changed the overall shape of the mutual information
surface nor the location of the maximum. The accuracy of
registration was therefore not affected. Experiments using REFERENCES
both single-modality (MRI against MRI and CT against CT) [1] S. M. Larson, C. R. Divgi, and A. M. Scott, “Overview of clinical ra-
and multimodality (CT against MRI, PET against MRI, and dioimmunodetection of human tumors,” Cancer, vol. 73, no. 3, Feb.
PET against CT) data sets yielded practically the same results 1994.
[2] J. G. Rosenman, E. P. Miller, G. Tracton, and T. J. Cullip, “Image reg-
(with errors on the order of 1/100th of a voxel dimension) istration: an essential part of radiation therapy treatment planning,” Int.
using hardware-accelerated and software-based registration J. Radiation Oncol., Biol., Phys., vol. 40, no. 1, pp. 197–205, Jan 1998.
approaches. [3] T. Nishioka et al., “Image fusion between 18FDG-PET and MRI/CT
for radiotherapy planning of oropharyngeal and nasopharyngeal
carcinomas,” Int. J. Radiation Oncol., Biol., Phys., vol. 53, no. 4, pp.
1051–1057, July 2002.
VI. CONCLUSION [4] S. De Santi et al., “Hippocampal formation glucose metabolism and
volume losses in MCI and AD,” Neurobiol. Aging, vol. 22, no. 4, pp.
Image registration through maximization of mutual informa- 529–539, July 2001.
tion is computationally intensive, demanding execution times [5] C. R. G. Guttmann et al., “Quantitative follow-up of patients with mul-
on the order of minutes on modern desktop computers. Because tiple sclerosis using MRI: reproducibility,” J. Magn. Reson. Imag., vol.
9, no. 4, pp. 509–518, Apr. 1999.
this algorithm is memory access limited, continuing rise in the [6] B. M. Dawant, S. L. Hartmann, and S. Gadamsetty, “Brain atlas
microprocessor speed leads to only moderate increase in the al- deformation in the presence of large space-occupying tumors,” in
gorithm’s speed. To overcome this fundamental computing lim- Proc. Medical Image Computing and Computer-Assisted Intervention,
MICCAI’99, Lecture Notes in Computer Science, vol. 1679, 1999, pp.
itation, we developed a new hardware architecture, called FAIR, 589–596.
for distributed, real-time 3-D image registration. [7] J. Mazziotta et al., “A probabilistic atlas and reference system for the
The FAIR architecture derives its speed from 1) a custom human brain: International Consortium for Brain Mapping (ICBM),”
Philosophical Trans. Royal Soc. London Series B—Biolog. Sci., pp.
interpolation pipeline with independent memory busses and 2) 1293–1322, Aug. 2001.
distributed processing. A practical implementation with stan- [8] J. B. Maintz and M. Viergever, “A survey of medical image registration,”
dard PC100 SDRAMs, operating at 80 MHz, provided 8.3-fold Med. Image Anal., vol. 2, no. 1, pp. 1–36, 1998.
[9] M. Holden et al., “Voxel similarity measures for 3-D serial MR brain
speed increase compared to a 1-GHz Pentium III computer image registration,” IEEE Trans. Med. Imag., vol. 19, pp. 94–102, Feb.
with PC133 SDRAMs running at 133 MHz. This speedup, 2000.
434 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 7, NO. 4, DECEMBER 2003

[10] M. Capek et al., “Robust and fast medical registration of [28] G. Knittel, “Verve: voxel engine for real-time visualization and examina-
3D-multi-modality data sets,” in Proc. Medicon 2001—IX Mediter- tion,” Computer Graphics Forum, vol. 12, no. 3, pp. 37–48, Mar. 1993.
ranean Conf. Medical and Biological Engineering and Computing, [29] H. Pfister and A. Kaufman, “Cube-4-a scalable architecture for real-time
2001, pp. 515–518. volume rendering,” in Proc. Symp. Volume Visualization, Oct. 1996, pp.
[11] W. M. Wells, P. Viola, H. Atsumi, S. Nakajima, and R. Kikinis, 47–54.
“Multi-modal volume registration by maximization of mutual informa- [30] A. Studholme, “Measures of 3D medical image alignment,” Ph.D. dis-
tion,” Med. Image Anal., vol. 1, pp. 35–51, 1996. sertation, Univ. London, London, U.K., 1997.
[12] F. Maes et al., “Multimodality image registration by maximization of
mutual information,” IEEE Trans. Med. Imag., vol. 16, pp. 187–198,
Apr. 1997.
[13] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever, “Mutual-informa-
tion-based registration of medical images: a survey,” IEEE Trans. Med.
Imag., vol. 22, pp. 986–1004, Aug. 2003. Carlos R. Castro-Pareja (M’00) received the B.Sc.
[14] T. Rohlfing and C. R. Maurer, “Non-rigid image registration in degree in electrical engineering from the Pontificia
shared-memory multiprocessor environments with application to Universidad Catolica del Peru, Lima, Peru, in 1999,
brains, breasts, and bees,” IEEE Trans. Inform. Technol. Biomed., vol. and the M.Sc. degree in electrical engineering from
7, pp. 16–25, Mar. 2003. The Ohio State University, Columbus, in 2001.
[15] T. Netsch, P. Rösch, A. van Muiswinkel, and J. Weese, “Toward real-time He is currently pursuing the Ph.D. degree with
multi-modality 3-D medical image registration,” in Proc. Eighth IEEE the Department of Electrical Engineering, The Ohio
Int. Conf. Computer Vision, ICCV 2001, vol. 1, 2001, pp. 718–725. State University. His research interests include
[16] R. Shekhar, V. Zagrodsky, M. Garcia, and J.JD. Thomas, “3D Stress development and hardware acceleration of image
echocardiography: a novel application based on registration of real-time processing algorithms.
3D ultrasound images,” in Proc. Computer Assisted Radiology and
Surgery (CARS), 2002, 2002, pp. 873–878.
[17] S. K. Warfield, F. Jolesz, and R. Kikinis, “A high performance approach
to the registration of medical imaging data,” Parallel Computing, vol.
24, no. 9–10, pp. 1345–1368, 1998.
[18] G. E. Christensen, M. I. Miller, M. W. Vannier, and U. Grenander, “In- Jogikal M. Jagadeesh (M’74) received the B.S.
dividualizing neuroanatomical atlases using a massively parallel com- degree from University College, Bangalore, India,
puter,” IEEE Comput., vol. 29, pp. 32–38, Jan. 1996. the M.S. degree from the Indian Institute of Science,
[19] M. Doggett and M. Meißner, “A memory addressing and access design Bangalore, and the Ph.D. degree from The Ohio
for real time volume rendering,” in Proc. 1999 IEEE Int. Symp. Circuits State University, Columbus, in 1962, 1964, and
Syst., ISCAS ’99, vol. 4, 1999, pp. 344–347. 1974, respectively.
[20] R. Shekhar and V. Zagrodsky, “Mutual information-based rigid and non- He is currently an Associate Professor with the De-
rigid registration of ultrasound volumes,” IEEE Trans. Med. Imag., vol. partment of Electrical Engineering and the College
21, pp. 9–22, 2002. of Pharmacy, The Ohio State University. His recent
[21] I. Kaplan et al., “Real time MRI-ultrasound image guided stereotactic research interests include signal processing and pat-
prostate biopsy,” Magnetic Resonance Imaging, vol. 20, pp. 295–299, tern analysis of bioelectric potentials, simulation and
2002. modeling of biological systems, and medical imaging, particularly high-field
[22] B. C. Porter et al., “Three-dimensional registration and fusion of ultra- magnetic resonance imaging.
sound and MRI using major vessels as fiducial markers,” IEEE Trans. Dr. Jagadeesh has served on the Editorial Board of IEEE Computer.
Med. Imag., vol. 20, pp. 354–359, Apr. 2001.
[23] B. Brendel, S. Winter, A. Rick, M. Stockheim, and H. Ermert, “Regis-
tration of 3D CT and ultrasound datasets of the spine using bone struc-
tures,” Computer Aided Surg., vol. 7, no. 3, pp. 146–155, 2002.
[24] S. K. Warfield et al., “Real-time registration of volumetric brain MRI
by biomechanical simulation of deformation during image guided neu- Raj Shekhar (M’94) received the B.Tech. degree
rosurgery,” Computing and Visualization in Science, vol. 5, no. 1, pp. in electrical engineering from the Indian Institute
3–11, July 2002. of Technology, Kanpur, in 1989, the M.S. degree in
[25] A. Roche, X. Pennec, G. Malandain, and N. Ayache, “Rigid registration bioengineering from the Arizona State University,
of 3-D ultrasound with MR images: a new approach combining intensity Tempe, in 1991, and the Ph.D. degree in biomedical
and gradient information,” IEEE Trans. Med. Imag., vol. 20, no. , pp. engineering from The Ohio State University,
1038–1049, Oct. 2001. Columbus, in 1997.
[26] J. L. Hennessy, D. A. Patterson, and D. Goldberg, Computer Architec- He worked as a Senior Research Engineer for
ture: A Quantitative Approach, 2nd ed. San Francisco, CA: Morgan Picker International (now Philips Medical Systems)
Kaufman, 1996. for two years before joining the Department of
[27] M. De Boer, A. Gröpl, J. Hesser, and R. Männer, “Latency- and Biomedical Engineering, Cleveland Clinic Founda-
hazard-free volume memory architecture for direct volume rendering,” tion, Cleveland, OH, in December 1998. His research interests include medical
in Proc. Eleventh Eurographics Workshop on Graphics Hardware, Aug. imaging, image processing, and computer graphics, and, currently, he leads
1996, pp. 109–119. research on real-time 3-D ultrasound and multimodality imaging.

You might also like