You are on page 1of 9

Real-Time Image

Processing on a
Custom Computing

Peter M. Athanas

S
everal aspects of image processing make it computationally chal-
A. Lynn Abbott lenging. A single image represents a data set of considerable size-
VirginiaPolytechnic Institute typically 256K picture elements, or pixels, for a black-and-white
and State University image. Many tasks require that several operations be performed on each
pixel in the image. Furthermore, when real-time operations are needed,
they must be performed at live video rates, typically 30 images per sec-
ond. To keep up with these capacious data rates and demanding com-
putations in real time, the processing engine must provide specialized
data paths, application-specific operators, creative data management,
and careful sequencing and pipelining.
Hardware designers typically must perform extensive behavioral test-
ing of a new concept before proceeding with an implementation. Due to
the enormous processing time required to simulate a complex image-
processing system, executing a VHDL model with a representative data
set even on a fast workstation is not practical. Days, or even weeks, are
commonly needed to simulate the processing of a single full-sized image.
And since some applications process sequences of images, designers may
need several hundred image simulations to adequately analyze only a few
seconds of data. Because of this, they are often forced into a trade-off
between how much testing can be afforded versus an acceptable risk in
allowing a silicon iteration.
We discuss an alternative, automated approach: transforming the struc-
tural representation (or transforming a behavioral model) into a real-time
implementation. With our system, a designer can proceed from a behav-
ioral description of the image-processing task to a functioning prototype
that can perform the task at full speed (rapid prototyping). Reconfigura-
tion from one image-processing task to another does not require physical
changes but is accomplished by downloading a hardware personalization
database to a novel computing platform. Reconfiguration takes just sec-
onds. A designer with this capability has

a means for evaluating the performance of an experimental algo-


The authors explore the utility rithwarchitecture, and
a working component that can be used in the development and testing
of custom computing of a much larger system.

-
machinery for accelerating the We chose an experimental custom computingplatform called Splash-2 to
investigate this approach to prototyping real-time image-processing
development, testing, and designs.' Custom computing platforms are emerging as a class of com-
puters that can provide near application-specific computational perfor-
prototyping of a diverse set of mance. Designers can also configure them for a variety of tasks. Such
platforms let designers customize specificoperations for function and size,
image-processing applications. and data paths for individual applications.

Computer 0018-~162/95/$4.00Q 1995 IEEE


We developed a real-time image-pro-
cessing system called VTSplash, based on ARCHITECTURAL CONSIDERATIONS
the Splash-2 general-purpose platform. FOR IMAGE PROCESSING
Splash-2is an attached processor featuring The goal of many image-processing tasks is to transform an input image into
programmable processing elements (PES) a new, enhanced version of the original. In some cases, each output pixel (or pic-
and communication paths. The Splash-2 ture element) can be computed as a function of a small neighborhood of adja-
system uses arrays of RAM-based field-pro- cent pixels from the input image Figure A shows an example 3 3
grammable gate arrays (FPGAs), crossbar neighborhood Each output pixel depends on a different neighborhood in the
networks, and distributed memory to original image. Conceptually, therefore, the output image is produced by slid-
accomplish the needed flexibility and per- ing a 3 x 3 window over the input image, with an output pixel resulting for
formance. Even though Splash-2 was not each new location of the window.
designed specificallyfor image processing, The choice of neighboi-
its architectui-a1properties are suited for the hood operation determines ,
computation and data-transfer rates char-
acteristic of this class of problems. The
the appearance of the out-
put image. A weighted sum
Column: 0 1 2 3 4 * * * I
price/performance ratio of this system also of neighborhood pixels, for Row 0
makes it competitivewith conventionalreal- example, could result in Row 1
time image-processingsystems. smoothed (low-pass filtered) Row 2
In this article, we explore the utility of or edge-enhanced (hiqh- Row 3
custom computing machinery for acceler- pass filtered) output images. Row 4
ating the development, testing, and proto- Median filtering generates
b
typing of a diverse set of image-processing the median value of each
applications. We first summarize architec- neighborhood. Many other
tural aspects of high-speed image process- filter types are possible.’ 3
ing. We next provide a synopsisof pertinent Although the nine pixels of
architectural features of the Splash-2 a 3 x 3 neighborhood are spa- age array. Each cell
processor and describe its development tially localized in the physical representsone pixel, which is commonly 8
environment. We then describe several image, this is not true in the, bits for a monochrome image. The shaded
image-processing tasks implemented on signals produced by most area indicates a 3 x 3 neighborhoodcentered
Splash-2 and conclude with a discussion of video sources. For example, a about pixel (3,4).
task performance. typical video camera produces
pixels in raster order, which
ARcHll€mRALAsPEcrs means that the pixel values are generated serially beginning with row 0, fcl-
OF IMAGE PROCESSING lowed by row 1, and so on. Figure 6 illustrates this process.
Conventionalgeneral-purpose machines For processing purposes, the straightforward approach is t o store the entire
cannot manage the distinctive I/O require- input image into local memory and access pixels as needed t o produce the out-
ments of most image-processingtasks; nei- put image. However, this approach results in a latency of a t least an entire
ther do they take advantage of the image frame time before the processor can begin t o generate output pixels.
opportunities for parallel computation pre- This latency can be reduced to less than the time of n rows (for an n x n neigh-
sent in many vision-related applications. borhood) in an architecture carefully designed to interleave memory reads and
Parallel processing systems such as mesh writes, effectively utilizing memory as a delay line We used Splash-2 t o imple-
computers or pipelined processors have ment both of these processing methods.
been successfully applied to some image-
processing tasks. Mesh architectures often
provide very large speedup after an image
is loaded, but overall performance often
suffers severely from I/O limitations.
Pipelined machines can accept image data
in real time from a camera or other source,
but historically they have proven difficult Figure B. Example image in raster order. Pixels are produced serially in row-
to reconfigure for various processing tasks. major order. Highlighted pixels represent a single 3 x 3 image
(The sidebar “Architecturalconsiderations neighborhood.
for image processing”further discusses the
unique requirements of image-processing
architectures.) References
Image data are typically produced and 1. R.M. Haralick and L.G. Shirapo, ComputerandRobot Vision,Vol. I, Addison-Wesley,
conveyed in raster order, that is, pixels are Reading, Mass., 1992.
presented serially, left-to-right for each 2. B. Jahne, Digital lmage Processing, Springer-Verlag, New York, 1991.
image row, beginning with the top row. If 3. A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd ed., Academic Press, New
a typical image frame is 512 rows x 512 York, 1982.
columns of 8-bit pixels, the total data in a
single frame is 262,144 pixels, or 2

February 1995

~
Processor arrav board 1
CUSTOM COMPUTING HARDWARE
Here we review the properties of custom computing
machines, using the Splash-2 platform as an example. We
then show how the custom computing machine can be
Interface board
used as a component in a real-time processing system.

The Splash-2 platform


Splash-2 is a second-generation processor designed by
the Supercomputing Research Center in Bowie, Maryland.
It achieves high computational performance by executing
an application in hardware customized to the needs of
individual applications. A Splash-2 system consists of one
to 15 Splash-2 arrayboards, an interface board, and a Sun
SparcStation-2 host. Each array board contains 16 PES,
denoted as X 1 through X16, arranged linearly and fully
connected through a 16 x 16 crossbar switch. A seven-
teenth control element, XO, regulates the crossbar net-
work. Figure l a is a system block diagram.

1 r-zzzq
Static RAM
1 Each PE within the Splash-2 array board (identified as
X 1 through X16 in Figure l a and expanded in Figure lb)
consists of one FPGA and one fast static memory. The Xilinx
XC4010 FPGA used in each PE consists of a 2D array of con-
figurable logicblocks that can be connected internallywith
reconfigurable interconnection resources. Both the logic
blocks and the interconnection resources are programma-
ble through the host computer. Computational operations
are implemented as logic circuits constructed within the
FPGAs by the operation into individual blocks and then
interconnecting them as required with the programmable
switches. A fast 256K x 16 static RAM (SRAM) is attached
to each FPGA, which allows one read or write access per
Figure 1T h e Splash-2 architecture:(a) Each board clock cycle. Each PE has three 36-bit bidirectional data
contains 16 PES (and one control processor) paths; one each to the left and right neighboring PESand
interconnectedby neighbors and through a crossbar one to the crossbar switch. In addition, a 16-bit path exists
switch. Up to 15 processor array boards can be between the FPGAand its SRAM.Several 1-bitsignals sup-
connected to an interface board; (b) Each PE port broadcasts, handshakes, and other special functions.
consists of an FPGA and an SRAM. The input data stream to the Splash-2 processor array is
provided by the interface board with a 36-bit SIMD bus to
the XO of each processor array board (and to the X 1 of the
megabits. This not only presents a computational chal- first processor array board). The output data stream can be
lenge but also poses storage problems. Some image- linked to multiple array boards by extending this stream
related tasks, such as compression, tracking, and motion from the X16 of one board to the X1 of the next. The out-
compensation, need information distributed in a single put stream produced from the last array board is returned
image (spatiallydistributed data) and also depend on data to the interface board. The control paths between the Sun
present in previous frames (temporally distributed data). SparcStation-2 host and the application program running
Because of this, the processor must store and retrieve mul- on Splash-2 consist of a set of handshake registers (two on
tiple frames quickly. each Splash-2 array board), a globalAND/OR mechanism,
For simplicity, we assume that data represent mono- a broadcast signal, direct access to on-board memory, and
chrome light intensityvalues. Of course, pixel data are not an interrupt mechanism.
restricted to represent only quantized brightness infor- The crossbar network contains 16 36-bit bidirectional
mation. The applications discussed here are also relevant ports for augmenting interprocessor communications.
to other types of image data, such as Splash’s crossbar switches can be used for both static and
dynamic architecture adjustments. Static adjustments
color images; establish the data paths for fixed systolic-liketasks, while
range images, for which each pixel represents a distance dynamic adjustments accommodate more complex data-
value; movement paradigms. The control element XO selects the
X-ray, ultrasound, or electron-microscope images, interconnectionstructure used in any given clock cycle. The
where each pixel depends on object density or other crossbar allows point-to-point, multicast, and broadcast
physical phenomena; and communication between all PESon each processing board
CT (computer tomography) images, for which each 2D and can readily change the topologyto a mesh, linear array,
(two-dimensional) image represents a reconstructed hypercube, or other custom configuration. (Arnold, Buell,
“slice” of density information within a 3 D array. and Davis’ provide more information on Splash-2.)

~ Computer
The image-processingplatform
Figure 2 shows the VTSplash laboratory
system we developed. Avideo camera or a
VCR creates a standard RS-170 video
stream. The signal produced from the
camera is digitized with a custom-built
frame-grabber card. This board not only
captures images but also performs any
needed sequencing or simple pixel opera-
tions before the data are presented to
Splash-2.The frame-grabber card was built
with a parallel interface that can be con-
nected direcfly to the input data stream of image display
the Splash-2 processor. I

The VTSplash laboratory system uses Figure 2. Components of the VTSplash laboratory system.
two processor array boards. The output
data produced by Splash-2-which can be
a real-time video data stream, image overlay data, or some the problem definition, but ( Problem definition )
other form of information-is first presented to another sample images are run
custom board for converting the data to an appropriate through the model (when
format (if necessary). Once formatted, the data are then possible) for later compari-
presented to a commercial image-acquisition/displaycard, son with the results of the
which presents the images to a color video monitor. The synthesized implementa-
SparcStation host configures Splash-2 arrays and sends tion.
runtime commands intermixed with the video stream (if The next step, which is t
needed). The laboratory system can be quickly reconfig- often difficult, consists of Design (VHDL)
ured from one task to another in just a few seconds by manually partitioning the
downloading the hardware personalization database. model into a form suitable
i
As mentioned earlier, Splash-2 is a general-purpose for final implementation on
( Simulation
machine not specifically designed for image processing. Splash-2. The model is first 4
Nonetheless, it is a suitabletestbed for implementinga wide mapped onto processor
range of computer vision tasks, includingthose that require boards and then parti-
temporal processing. One Splash-2 processing board con- tioned more finely into indi-
tains slightly more than 69 megabits of memory-enough vidual PES.The three main
for 32 frames of image data. (This number is based on 17 factors that drive a partition
256Kx 16 SRAM devices plus 12,800 bits of storage (max- are time, area, and commu-
imum) in eachofthe 17Xilinx4010chips.') Notal1this stor- nication complexity. ___ ~ _ _
age may be conveniently available to applications. The time and area factors Figure 3. The application
are familiar problems dis- design process.
APPLICATION DEVELOPMENT ON cussed in the high-levelsyn-
SPLASH-2 thesis and silicon compiler
While the programming environment for Splash-2is one literature.2Time refers to how much computation is desired
of the most advanced and automated in its class, numerous per clock cycle. Area refers to how much of the reconfig-
difficulties must be addressed before this type of machine urable resources should be allocated to a given computa-
can become accepted into mainstream computing. Here we tion, to the total available reconfigurable resources within
provide a brief summary of the rapid prototype-develop- each processor board, and to each of the 17 PESon each
ment process from the formulation of task behavior to the board. Even though Splash-2 contains ample hardware
generation of a physical database read for execution. We support to aid signal propagation between PES,not all com-
then assess some challenges that need to be addressed. munications are equal in cost and in bandwidth (commu-
nication complexity). Splash-2 imposes limitations on
Basic design flow available communication resources. Some of these are
Figure 3 illustrates the basic design flow for developing
a typical Splash-2 application. (This simplified figure does a total of 108signals split equally between the left neigh-
not depict all possible iteration paths in the design bor, right neighbor, and crossbar network;
proces.s.) The first step in the process is the definition of a 16-bit data path between a PE and its 0.5-megabyte
the problem. As in all hardware and software system RAM;
design, a sound problem definition will facilitate the several 1-bit signals for global communications and
design process. broadcasts; and
Step two is the behavioral modeling of the problem. a 36-bit data path between processing boards, along
Typically, a VTSplash programmer models the problem by with several 1-bit global signals.
using the C programming language or a behavioralVHDL
model. Not only is the model constructed to comply with (These numbers are simplified somewhat for the sake of

February 1995
tion annotation. Actual
Table 1. A representative l i s t of image-processing categories and example tasks.
propagation delays in the
Xilinx FPGAs are sensitive
~~

Class Example image task Description to the outcome of the place-


Transformation Convolution Linear filtering operation ment and routing process,
Median filtering Nonlinear filter used t o eliminate and these delays can have a
"salt and pepper" noise disturbing effect on appli-
Morphological filtering Nonlinear operations that alter cation behavior. To counter
region shapes in an image these problems, and to
Gray-scale erosion and dilation cope with the limited func-
operations implemented tional coverage that can be
Combination Laplacian pyramid generation Produces an image hierarchy of achieved with the simula-
decreasing image size and spatial tion tools, a powerful de-
resolution. The image for each bugging tool is available in
pyramid level formed by taking the Splash-2 environment.
the difference of t w o blurred The T2 interactive debug-
versions of the original image. ger' provides the power of
Measurement Histogram generation Statistical operation for computing conventional high-level
intensity distribution o f pixels in an language debuggers by
image allowing such features as
Conversion Fast Fourier transform Converts an image from the spatial monitoring internal state
domain t o the frequency domain variables and tracing.
Hough transform A voting scheme that detects the Debugging a hardware/
presence of lines (or parametric software codesign adds
curves) from a set of points in an conceptual difficulties not
image found in traditional debug-
Region detection and labeling Finds connected regions in an image ging environments. After
and assigns a unique label t o each the image operations are
performing satisfactorily,
they must be integrated
within the body of an
this discussion.The actual communicationstructure is more application. A rich C library to facilitate communication
intricate. Refer to Arnold, Buell, and Davis1for more detail.) between host programs and attached processors is acces-
Although these numbers may appear to be quite generous, sible within the Splash-2 environment.
the limits of these data paths will eventually disgruntle
some designers. Not allapplicationseasilymap to these lim- Reducing development time
itations, and tough design trade-offs must be considered. Although Splash-2represents the state-of-the-art in cus-
As it stands, few quantitative up-front measures are avail- tom computing processors-both in hardware capabili-
able to gauge partitioned alternatives. A designer must ties and software support-it requires a substantial time
often wait until after the synthesis step before knowing investment to develop an application. To make this class of
whether a given problem partition is feasible. machinery more widely accepted and cost-effective,meth-
ods must be developed to reduce application-development
Detailed structural design time. Several promising endeavors focus on this i ~ s u e . ~ - ~
After the problem is partitioned, the designer produces Their main emphasis is depicted by the portions
and verifies a detailed structural design. Many alterna- of the gray shaded region previously shown in Figure 3.
tives are available to designers for converting the struc-
tural representation into a hardware configuration IMAGE-PROCESSING TASKS
database, including FPGA design tools like XBLOX.3 Image operations have been classified into five generic
However, the best-supported design environment for classes.' An operation in the combination class takes two
Splash-2 contains the SynopsysVHDL simulation and syn- images and produces a new image of the same type. This
thesis tools. The simulations for many of the image-pro- is accomplished by combining each pair of elements from
cessing tasks we discuss consumed several days of CPU the input images into a new element. The transformation
time per run on a SparcStation-l&in many cases, forjust class accepts an image from a given class and produces a
a small fraction of an image. Because of this, only so much new image in the same class. The measurement class
simulation can occur within a reasonable amount of time. reduces an image of a given type into a scalar or vector.
Therefore, the stimulation input for a simulation run must The conversion class refers to those operations that take
be considered judiciously. an image of a given type and convert it into a new class.
(The generation class, which produces a new image from
Debugging scratch, concerns image synthesis rather than image pro-
Because simulationsof the applications under development cessing and so is not considered here.)
are based on VHDL models created prior to placement and To evaluate the agility of the VTSplash system, we mod-
routing of the FPGAs, they are barren of signal propaga- eled examples from each of the first four categories (with

Computer
ing techniques are very Figure 4. Example filtering operations: (a) original image; (b) smoothed image
common in image process- created by applying a low-pass filter to the original image; and (c) edge image
ing. The most common created by applying a simple high-pass filter. All images are processedas 512 x
methods process small 512 pixels in size. The output images were obtained by using 8 x 8 templates on
neighborhoods in an input VTSplash.
image to generate new
pixels in an output image.
Neighborhood-based filtering is characterized by the criterion. If an image is recursively filtered and subsam-
repeated application of identical operations and often pled, the resulting set of images can be considered a sin-
serves as a preprocessing step followed by higher-level gle unit called a Gaussian pyramid. An image-processing
image analysis. system can use this data structure to reduce computational
Neighborhood operations typically use a 2D template, requirements by employing the lower-resolution portion
usually rectangular, which is applied at every pixel in the of the pyramid to guide processing at higher-resolution
input image. (The template is often called an operator or levels. For some tasks (such as surveillance and road fol-
filter.) In the linear case, the hardware applies a template lowing), this approach can greatly reduce the overall
by centering it at a given pixel of the input image, multi- amount of processing. (Burt and Adelsonloprovide a pop-
plying each template pixel by the associated underlying ular technique for generating these pyramids.)
image pixel, and summing the resulting products. The sum In addition to low-pass pyramids, a system can gener-
becomes the pixel value (for this template position) in the ate band-pass (or Laplacian) pyramids, in which each level
output image. Each new template position generates a sin- of the pyramid contains information from a single fre-
gle new output pixel value. (Special rules may be needed quency band. VTSplash can process either type of pyra-
for pixels near the image borders.) Algebraically, this is mid by dynamically reconfiguring data paths through the
represented as crossbar. Gaussian and Laplacian pyramids are produced
at 30 per second and 15 per second, respectively.

I Measurementcomputations
Unlike the other processing classes, measurement oper-
where I,, is the input image, I,,, is the output image, h is ations typically do not produce a new output image.
the filter, and rand c refer to the row and column location Instead, the goal is to extract descriptive statistics of the
in the images. The summation is typically performed over input image. For example, the mean and standard devia-
a small window, often 3 x 3 or 7 x 7, as determined by h. tion of pixel values in the image are often of interest. These
Figure 4 shows examples. and similar statistics can be computed by using simple
Template operations can also be nonlinear. For exam- multiply-accumulate processing, where one such opera-
ple, designers can implement a median filter by using a tion is required for each input pixel.
template. For every position of the template, the hardware Real-time histogram generation, another useful opera-
system chooses the median value from the image pixels tion, often constitutes an initial step for other applications,
covered by the template and uses it as the new pixel value such as region detection and region labeling. In generat-
for the output image. In this case, the template simply ing a histogram, the processor must maintain and update
serves as a window and has no cell values. Another form a 1D array that records the number of occurrences of par-
of nonlinear image processing is based on mathematical ticular pixelvalues. Histograms are often further analyzed
morph~logy.~ This algebra uses multiplication, addition and used to adjust parameters for image enhancement.
(subtraction), and maximum (minimum) operations to
produce output pixels. The filtering operations, known as Image-conversionoperations
erosion and dilation, can be used to perform such tasks as The 2D discrete Fourier transform (DFT) is a useful
low- or high-pass filtering and feature detection. This operation in signal-processing applications but is often
approach provides less blurring than linear filtering. avoided because of its large computational requirements.
Although linear, it differs from the neighborhood opera-
Image-combinationoperations tions described above, since every transformed output
After an image has been appropriately low-pass filtered, pixel depends on every pixel of the input image. The prob-
it can be subsampled without fear of violating the Nyquist lem can be simplified somewhat, since the 2D Fourier

February 1995
1 I

Video in , Pixel packing Row ( i ) stack Row ( i + l ) stack Row ( i + 2 ) stack

Video out
4

(a) output Parallel sort

Data buffering and Pass-1 labeling Pass-2 merging Pass-2


thresholding (even) merging

PE-0 PE-1 -+ PE-2 * PE-3 * PE-4


Video in + 4

Format 1 I
(b) output (odd)

Double buffering Bank 1 (left) input Floating-point

PE-4 * PE-5 -b PE-6 --+PE-7

Video in I
I I

Video out

Frequency domain Bank 2 (right) input


filter (floating point) resequencing

Video in Sequencing and Hough Hough Hough Hough Hough Hough

PE-16
U
Hough
PE#14
-
PE-15

Hough
PE#13
-
PE-14
-
Hough
PE#12
-
PE-13

Hough
PE#11
PE-12

Hough
PE#lO
@
=%!-&
1
1
(Hough
PE111
PE #9
PE 1 0 '

Hough
PE #8
2 PE-9

Hough
PE #7
1
__ -~
igure 5. Examples of the communications structure and partitioningfor four examplesthat use only one
plash-2 processorarray: (a) median filter, (b) region detection and labeling, (c) FFT (forwardtransform),
nd (d) Hough transform. Solid squares at PE sites denote unused PES.

.ansform can be decomposed into multiple 1 D fast transform can search for other shapes in an image.) Each
ourier transforms (FFTs). For example, a 512 x 512 DFT boundary point in the original image specifies a curve in
sn be implemented as 512 1DFFT computations (one for the transform space. The coordinates of high-intensity
each row) followed by 512 additional 1D FFTs (one per points in the transform domain correspond to the position
column). We have implemented this on VTSplash using and orientation of best-fit lines in the original image.
floating-point arithmetic. Region detection and labeling is a common operation
The Hough transform, another 2D transformation, can for automated visual-inspection tasks. The purpose is to
be used after an edge-detection task to determine if a set identify connected regions in an image and assign a unique
of points lie on a straight 1ine.l' (The generalized Hough label to each region. This is not trivial when objects in the

Computer
-
image are nonconvex (such
as gears, blood cells, or Histogram ,
alphanumeric characters).
When receiving image data Region labeling
in raster-scan order, design- Median filter
1 Floating point ~

ers may not know if two or


more regions, disjoint at
the top of the image, con-
Hough transform

Pyramid generation
7
nect later at the bottom of
the image. Typical solu- Aorphological operations E======
tions to region labeling
often require a second pass
8 x 8 convolution
2D-FFT (floating point)
J
over the image to merge
connected regions. Once 0 100 200 300 400 500 600 700
this is done, unique labels Millions of operations per second
can be assigned to each
region. Figure 6. Approximate performance of image-processingtasks.

PERFORMANCE
RESULTS logic function units (word parallel) active in each task.
Computational properties, communications architec- The second component of the performance bar esti-
tures, and required resources vary significantly from one mates the number of storage references (memory
application to the next. All the examples described here accesses) performed by the task per second. The third
operate at the pixel clock rate of 10 MHz with 512 x 512 component represents the number of floating-point oper-
images. Many of the applications presented here have ations per second. All tasks, except for the 2D FFT appli-
been implemented using a pipeline architecture. The cation, use fixed-point operations. The pixel calculations
pipeline accepts digitized image data in raster order, often for the 2D FFT task use custom-designed floating-point
directly from a camera, and produces output data at the arithmetic. When combined, these three components pro-
same rate, with some latency. Many of these applications vide a basis for quantifymg the computational load of each
can be chained together to form higher level image-pro- task, as well as a rough estimate of the number of opera-
cessing functions. tions performed each second.
With VTSplash, the operating speed for an application
Prototype architectures is under the designer’s control and depends upon critical
Figure 5 shows simplified block diagrams that illustrate path delays in the implementation. The Splash-2 proces-
the partitioning and communication architecture for rep- sor features a programmable system clock that can be var-
resentative tasks. For example, Figure Sa shows the archi- ied under software control from zero to 40 MHz. We
tecture for the 3 x 3 median filter. This pipelined developed the tasks to satisfy the minimum criterion of
implementation produces output pixels at the same rate operating at the pixel data rate of 10 MHz. The designs
that input pixels are received, with a latency of less than wereverified at this rate only, although some of these tasks
the time required to receivethree image rows. This requires may operate well beyond this clock frequency.
the simultaneous access of nine neighboring pixels (pro-
duced by the gray shaded blocks labeledRowstack), which Processing rates
are presented to a parallel sorter in the gray shaded block In addition to quantifymg the number of operations per
labeled Parallel sort. The median value of the sorted list is second, it is useful to consider how fast computations are
then presented to the Format output block, which assem- performed relative to the 30-Hz frame rate of the input
bles the data for subsequent display on the monitor. image. Some tasks (histogramming, median filtering,
region labeling, Gaussian pyramid generation, and gray-
Performance evaluation scale morphologicaloperations) are completed during one
Conventional performance-benchmarking techniques frame time. Others (8x 8 convolution and Laplacianpyra-
are at best awkward when applied to custom computing mid generation) require two image frames. The floating-
machinery. Figure 6 graphically illustrates the computa- point FFT implementation can completelyprocess two 512
tional performance of each of these tasks executing on the x 512 images per second. The time necessary to complete
VTSplash platform. In this figure, the application name the Hough transform depends on the complexity of the
appears vertically to the left of the graph. The performance image; the implementation shown in Figure 5d distributes
bar assopiatedwith each task consists of two or three com- equal portions of an input image to separate PES that
ponents. The first component (arithmetic/logical) is an process in parallel.
appraisal of the number of general-purpose operations per-
formed on average per second. (These operations are likely Comparisons
to be found in the repertory of common RISC processors, Another method of benchmarking is to compare
such as Multiply, Xor, or Compare.) This number, when VTSplash operation with that of contemporary machines.
divided by the pixel clock frequency of 10 MHz, indicates We chose a general-purpose workstation (the Sun
the average number of the easilydiscerniblearithmetic and SparcStation- 10). VTSplash applications run between 10

February 1995
to 100times faster than the same application written in C 3. The Programmable Gate Array Data Book, Xilinx Inc., San
and executed on the SparcStation.Numerous commercial Jose, Calif., 1994.
machines have been designed specifically for image pro- 4. M. Gokhale and R. Minnich, “FPGAComputingin a Data-Par-
cessing. The Datacube MaxVideo 20Ol2consists of several allel C,” in Proc. IEEE Workshopon FPGAsfor Custom Com-
functional units carefully tuned to perform common puting, IEEE CS Press, Los Alamitos, Calif., Order No.
image-processing tasks. For applications that are suited 93TH0535-5,1993,pp. 94-101.
for its specific architecture, the MaxVideo 200 outper- 5. P. Athanas and H. Silverman, “Processor Redonfiguration
forms the VTSplash system. For example, the MaxVideo through Instruction-Set Metamorphosis:Architecture and
200 can perform 8 x 8 convolution four times faster than Compiler,” Computer, Vol. 26, No. 3, Mar. 1993,pp. 11-18.
our current VTSplash design. The motivation of the cus- 6. L. Agarwal and M. Wazlowski,“An AsynchronousApproach
tom computing approach, therefore, is not to provide the to Efficient Executionof Programs on Adaptive Architectures
fastest possible performance for a given task. As illustrated UtilizingFPGAs,” Proc. IEEE Workshop on FPGAsfor Custom
by VTSplash, the strength of this approach is a system that Computing, IEEE CS Press, Los Alamitos, Calif., Order No.
can be rapidly reconfigured to provide high performance 5490-02U, 1994,pp. 111-119.
for a wide range of tasks. The performance of application- 7. R.C. Vogt, Automatic Morphological Set Recognition Algo-
specific systems diminishes quickly for tasks not directly rithms, Springer-Verlag,New York, 1989.
supported in hardware. 8. B. Jtihne,Digital ImageProcessing,Springer-Verlag,New York,
1991.
9. A.L. Abbott, R.M. Haralick, and X. Zhuang, “PipelineArchi-
RECONFIGURAEILE COMPUTING PLATFORMS, such as Splash- tectures for MorphologicImage Analysis,”Machine Vision and
2, can readily adapt to meet the communication and com- Applications,Vol. 1,No. 1,1988, pp. 23-40.
putational requirements of a wide variety of applications. 10. P.J. Burt and E.H. Adelson, “TheLaplacian F’yramid as a Com-
By adding VO hardware, we have demonstrated that gen- pact Image Code,”IEEE Trans. Comm., Vol. COM-31, No. 4,
eral-purpose custom computing machines are well suited Apr. 1983,pp. 532-540.
for many meaningful image-processingtasks. Such plat- 11. L. Abbott et al., “Finding Lines and Building Pyramids with
forms are excellent testbeds for prototyping high-perfor- Splash-2,”Proc.IEEE Workshopon FPGAs for Custom Com-
mance algorithms. The custom computing platform can puting, IEEE CS Press, Los Alamitos, Calif., Order No. 5490-
also serve as 02U, 1994, pp. 155-163.
12. TheMaxVideoZOOReferenceManual,Datacube Inc., Danvers,
a medium for hardware/software codesign, and Mass., 1994.
a VHDL accelerator.

Our work on VTSplash continues in the area of high-


level-language compilation for custom computing Peter M . Athanas is an assistantprofessor in the Bradley
machines. We are investigating architectural enhance- Department of Electrical Engineering at Virginia Polytech-
ments for broadening the scope of tasks suitable for these nic Institute and State Universityin Blacksburg, Virginia.He
machines and for streamlining automated partitioning was also a senior design engineer in the Advanced Technolo-
and scheduling. Application development continues for gies Groupfor United Technologies Hamilton Standard in
image processing as well as for other problem domains Windsor Locks, Connecticut. His research interests include
including communications. I custom computingmachinery, logicsynthesis, VZSI technol-
ogy, and parallel processing.Athanas received a BS degree
in electrical engineeringfrom the University of Toledo, an
MS degree in electrical engineeringfrom Rensselaer Poly-
Acknowledgments technic Institute, and an ScM degree in applied mathemat-
We are indebted to the graduate students who con- ics and a PhD in electrical engineering from Brown
tributed to the VTSplash program. These include appli- University.He is a member of the IEEE Computer Society.
cation developers Luna Chen, Robert Elliott, James
Peterson, Ramana Rachakonda, Nabeel Shirazi, Adit A. Lynn Abbott is an assistantprofessor in the Bradley
Tarmaster, and Al Walters, along with theVTSplash hard- Department of Electrical Engineering at Virginia Polytech-
ware/software team of Brad Fross and Jeff Nevits. In addi- nicInstitute and State University,From 1980to 1985, he was
tion, we appreciate the support and guidance of Jeffrey a member of the technicalstaffatAT&TBell Laboratories in
Arnold and Duncan Buell from the Supercomputing Holmdel, NJ.His research interests include computer vision,
Research Center. This work was supported by a grant from high-performance architecturesfor image processing, and
the Institute for Defense Analyses. artificial intelligence.Abbott received a BSfrom Rutgers Uni-
versity in 1980, an MSfrom Stanford University in 1981, and
a PhDfrom the Universityof Illinois at Urbana-Champaign
References in 1990 (all in electrical engineering). He is a member of the
1 J.M. Arnold, D.A. Buell, and E.G. Davis, “Splash 2,” inProc. IEEE Computer Society.
FourthAnn7 ACM Symp. ParallelAlgorithms andArchitectures,
ACM, New York, 1992,pp. 316-322. Readers can contact the authors at the Bradley Department
2. D. Gajski, Silicon Compilation, Addison-Wesley, Reading, of Electrical Engineering, VirginiaPolytechnicInstitute and
Mass., 1988. State University, Blacksburg, VA24061-0111.

Computer

You might also like