Professional Documents
Culture Documents
Processing on a
Custom Computing
Peter M. Athanas
S
everal aspects of image processing make it computationally chal-
A. Lynn Abbott lenging. A single image represents a data set of considerable size-
VirginiaPolytechnic Institute typically 256K picture elements, or pixels, for a black-and-white
and State University image. Many tasks require that several operations be performed on each
pixel in the image. Furthermore, when real-time operations are needed,
they must be performed at live video rates, typically 30 images per sec-
ond. To keep up with these capacious data rates and demanding com-
putations in real time, the processing engine must provide specialized
data paths, application-specific operators, creative data management,
and careful sequencing and pipelining.
Hardware designers typically must perform extensive behavioral test-
ing of a new concept before proceeding with an implementation. Due to
the enormous processing time required to simulate a complex image-
processing system, executing a VHDL model with a representative data
set even on a fast workstation is not practical. Days, or even weeks, are
commonly needed to simulate the processing of a single full-sized image.
And since some applications process sequences of images, designers may
need several hundred image simulations to adequately analyze only a few
seconds of data. Because of this, they are often forced into a trade-off
between how much testing can be afforded versus an acceptable risk in
allowing a silicon iteration.
We discuss an alternative, automated approach: transforming the struc-
tural representation (or transforming a behavioral model) into a real-time
implementation. With our system, a designer can proceed from a behav-
ioral description of the image-processing task to a functioning prototype
that can perform the task at full speed (rapid prototyping). Reconfigura-
tion from one image-processing task to another does not require physical
changes but is accomplished by downloading a hardware personalization
database to a novel computing platform. Reconfiguration takes just sec-
onds. A designer with this capability has
-
machinery for accelerating the We chose an experimental custom computingplatform called Splash-2 to
investigate this approach to prototyping real-time image-processing
development, testing, and designs.' Custom computing platforms are emerging as a class of com-
puters that can provide near application-specific computational perfor-
prototyping of a diverse set of mance. Designers can also configure them for a variety of tasks. Such
platforms let designers customize specificoperations for function and size,
image-processing applications. and data paths for individual applications.
February 1995
~
Processor arrav board 1
CUSTOM COMPUTING HARDWARE
Here we review the properties of custom computing
machines, using the Splash-2 platform as an example. We
then show how the custom computing machine can be
Interface board
used as a component in a real-time processing system.
1 r-zzzq
Static RAM
1 Each PE within the Splash-2 array board (identified as
X 1 through X16 in Figure l a and expanded in Figure lb)
consists of one FPGA and one fast static memory. The Xilinx
XC4010 FPGA used in each PE consists of a 2D array of con-
figurable logicblocks that can be connected internallywith
reconfigurable interconnection resources. Both the logic
blocks and the interconnection resources are programma-
ble through the host computer. Computational operations
are implemented as logic circuits constructed within the
FPGAs by the operation into individual blocks and then
interconnecting them as required with the programmable
switches. A fast 256K x 16 static RAM (SRAM) is attached
to each FPGA, which allows one read or write access per
Figure 1T h e Splash-2 architecture:(a) Each board clock cycle. Each PE has three 36-bit bidirectional data
contains 16 PES (and one control processor) paths; one each to the left and right neighboring PESand
interconnectedby neighbors and through a crossbar one to the crossbar switch. In addition, a 16-bit path exists
switch. Up to 15 processor array boards can be between the FPGAand its SRAM.Several 1-bitsignals sup-
connected to an interface board; (b) Each PE port broadcasts, handshakes, and other special functions.
consists of an FPGA and an SRAM. The input data stream to the Splash-2 processor array is
provided by the interface board with a 36-bit SIMD bus to
the XO of each processor array board (and to the X 1 of the
megabits. This not only presents a computational chal- first processor array board). The output data stream can be
lenge but also poses storage problems. Some image- linked to multiple array boards by extending this stream
related tasks, such as compression, tracking, and motion from the X16 of one board to the X1 of the next. The out-
compensation, need information distributed in a single put stream produced from the last array board is returned
image (spatiallydistributed data) and also depend on data to the interface board. The control paths between the Sun
present in previous frames (temporally distributed data). SparcStation-2 host and the application program running
Because of this, the processor must store and retrieve mul- on Splash-2 consist of a set of handshake registers (two on
tiple frames quickly. each Splash-2 array board), a globalAND/OR mechanism,
For simplicity, we assume that data represent mono- a broadcast signal, direct access to on-board memory, and
chrome light intensityvalues. Of course, pixel data are not an interrupt mechanism.
restricted to represent only quantized brightness infor- The crossbar network contains 16 36-bit bidirectional
mation. The applications discussed here are also relevant ports for augmenting interprocessor communications.
to other types of image data, such as Splash’s crossbar switches can be used for both static and
dynamic architecture adjustments. Static adjustments
color images; establish the data paths for fixed systolic-liketasks, while
range images, for which each pixel represents a distance dynamic adjustments accommodate more complex data-
value; movement paradigms. The control element XO selects the
X-ray, ultrasound, or electron-microscope images, interconnectionstructure used in any given clock cycle. The
where each pixel depends on object density or other crossbar allows point-to-point, multicast, and broadcast
physical phenomena; and communication between all PESon each processing board
CT (computer tomography) images, for which each 2D and can readily change the topologyto a mesh, linear array,
(two-dimensional) image represents a reconstructed hypercube, or other custom configuration. (Arnold, Buell,
“slice” of density information within a 3 D array. and Davis’ provide more information on Splash-2.)
~ Computer
The image-processingplatform
Figure 2 shows the VTSplash laboratory
system we developed. Avideo camera or a
VCR creates a standard RS-170 video
stream. The signal produced from the
camera is digitized with a custom-built
frame-grabber card. This board not only
captures images but also performs any
needed sequencing or simple pixel opera-
tions before the data are presented to
Splash-2.The frame-grabber card was built
with a parallel interface that can be con-
nected direcfly to the input data stream of image display
the Splash-2 processor. I
The VTSplash laboratory system uses Figure 2. Components of the VTSplash laboratory system.
two processor array boards. The output
data produced by Splash-2-which can be
a real-time video data stream, image overlay data, or some the problem definition, but ( Problem definition )
other form of information-is first presented to another sample images are run
custom board for converting the data to an appropriate through the model (when
format (if necessary). Once formatted, the data are then possible) for later compari-
presented to a commercial image-acquisition/displaycard, son with the results of the
which presents the images to a color video monitor. The synthesized implementa-
SparcStation host configures Splash-2 arrays and sends tion.
runtime commands intermixed with the video stream (if The next step, which is t
needed). The laboratory system can be quickly reconfig- often difficult, consists of Design (VHDL)
ured from one task to another in just a few seconds by manually partitioning the
downloading the hardware personalization database. model into a form suitable
i
As mentioned earlier, Splash-2 is a general-purpose for final implementation on
( Simulation
machine not specifically designed for image processing. Splash-2. The model is first 4
Nonetheless, it is a suitabletestbed for implementinga wide mapped onto processor
range of computer vision tasks, includingthose that require boards and then parti-
temporal processing. One Splash-2 processing board con- tioned more finely into indi-
tains slightly more than 69 megabits of memory-enough vidual PES.The three main
for 32 frames of image data. (This number is based on 17 factors that drive a partition
256Kx 16 SRAM devices plus 12,800 bits of storage (max- are time, area, and commu-
imum) in eachofthe 17Xilinx4010chips.') Notal1this stor- nication complexity. ___ ~ _ _
age may be conveniently available to applications. The time and area factors Figure 3. The application
are familiar problems dis- design process.
APPLICATION DEVELOPMENT ON cussed in the high-levelsyn-
SPLASH-2 thesis and silicon compiler
While the programming environment for Splash-2is one literature.2Time refers to how much computation is desired
of the most advanced and automated in its class, numerous per clock cycle. Area refers to how much of the reconfig-
difficulties must be addressed before this type of machine urable resources should be allocated to a given computa-
can become accepted into mainstream computing. Here we tion, to the total available reconfigurable resources within
provide a brief summary of the rapid prototype-develop- each processor board, and to each of the 17 PESon each
ment process from the formulation of task behavior to the board. Even though Splash-2 contains ample hardware
generation of a physical database read for execution. We support to aid signal propagation between PES,not all com-
then assess some challenges that need to be addressed. munications are equal in cost and in bandwidth (commu-
nication complexity). Splash-2 imposes limitations on
Basic design flow available communication resources. Some of these are
Figure 3 illustrates the basic design flow for developing
a typical Splash-2 application. (This simplified figure does a total of 108signals split equally between the left neigh-
not depict all possible iteration paths in the design bor, right neighbor, and crossbar network;
proces.s.) The first step in the process is the definition of a 16-bit data path between a PE and its 0.5-megabyte
the problem. As in all hardware and software system RAM;
design, a sound problem definition will facilitate the several 1-bit signals for global communications and
design process. broadcasts; and
Step two is the behavioral modeling of the problem. a 36-bit data path between processing boards, along
Typically, a VTSplash programmer models the problem by with several 1-bit global signals.
using the C programming language or a behavioralVHDL
model. Not only is the model constructed to comply with (These numbers are simplified somewhat for the sake of
February 1995
tion annotation. Actual
Table 1. A representative l i s t of image-processing categories and example tasks.
propagation delays in the
Xilinx FPGAs are sensitive
~~
Computer
ing techniques are very Figure 4. Example filtering operations: (a) original image; (b) smoothed image
common in image process- created by applying a low-pass filter to the original image; and (c) edge image
ing. The most common created by applying a simple high-pass filter. All images are processedas 512 x
methods process small 512 pixels in size. The output images were obtained by using 8 x 8 templates on
neighborhoods in an input VTSplash.
image to generate new
pixels in an output image.
Neighborhood-based filtering is characterized by the criterion. If an image is recursively filtered and subsam-
repeated application of identical operations and often pled, the resulting set of images can be considered a sin-
serves as a preprocessing step followed by higher-level gle unit called a Gaussian pyramid. An image-processing
image analysis. system can use this data structure to reduce computational
Neighborhood operations typically use a 2D template, requirements by employing the lower-resolution portion
usually rectangular, which is applied at every pixel in the of the pyramid to guide processing at higher-resolution
input image. (The template is often called an operator or levels. For some tasks (such as surveillance and road fol-
filter.) In the linear case, the hardware applies a template lowing), this approach can greatly reduce the overall
by centering it at a given pixel of the input image, multi- amount of processing. (Burt and Adelsonloprovide a pop-
plying each template pixel by the associated underlying ular technique for generating these pyramids.)
image pixel, and summing the resulting products. The sum In addition to low-pass pyramids, a system can gener-
becomes the pixel value (for this template position) in the ate band-pass (or Laplacian) pyramids, in which each level
output image. Each new template position generates a sin- of the pyramid contains information from a single fre-
gle new output pixel value. (Special rules may be needed quency band. VTSplash can process either type of pyra-
for pixels near the image borders.) Algebraically, this is mid by dynamically reconfiguring data paths through the
represented as crossbar. Gaussian and Laplacian pyramids are produced
at 30 per second and 15 per second, respectively.
I Measurementcomputations
Unlike the other processing classes, measurement oper-
where I,, is the input image, I,,, is the output image, h is ations typically do not produce a new output image.
the filter, and rand c refer to the row and column location Instead, the goal is to extract descriptive statistics of the
in the images. The summation is typically performed over input image. For example, the mean and standard devia-
a small window, often 3 x 3 or 7 x 7, as determined by h. tion of pixel values in the image are often of interest. These
Figure 4 shows examples. and similar statistics can be computed by using simple
Template operations can also be nonlinear. For exam- multiply-accumulate processing, where one such opera-
ple, designers can implement a median filter by using a tion is required for each input pixel.
template. For every position of the template, the hardware Real-time histogram generation, another useful opera-
system chooses the median value from the image pixels tion, often constitutes an initial step for other applications,
covered by the template and uses it as the new pixel value such as region detection and region labeling. In generat-
for the output image. In this case, the template simply ing a histogram, the processor must maintain and update
serves as a window and has no cell values. Another form a 1D array that records the number of occurrences of par-
of nonlinear image processing is based on mathematical ticular pixelvalues. Histograms are often further analyzed
morph~logy.~ This algebra uses multiplication, addition and used to adjust parameters for image enhancement.
(subtraction), and maximum (minimum) operations to
produce output pixels. The filtering operations, known as Image-conversionoperations
erosion and dilation, can be used to perform such tasks as The 2D discrete Fourier transform (DFT) is a useful
low- or high-pass filtering and feature detection. This operation in signal-processing applications but is often
approach provides less blurring than linear filtering. avoided because of its large computational requirements.
Although linear, it differs from the neighborhood opera-
Image-combinationoperations tions described above, since every transformed output
After an image has been appropriately low-pass filtered, pixel depends on every pixel of the input image. The prob-
it can be subsampled without fear of violating the Nyquist lem can be simplified somewhat, since the 2D Fourier
February 1995
1 I
Video out
4
Format 1 I
(b) output (odd)
Video in I
I I
Video out
PE-16
U
Hough
PE#14
-
PE-15
Hough
PE#13
-
PE-14
-
Hough
PE#12
-
PE-13
Hough
PE#11
PE-12
Hough
PE#lO
@
=%!-&
1
1
(Hough
PE111
PE #9
PE 1 0 '
Hough
PE #8
2 PE-9
Hough
PE #7
1
__ -~
igure 5. Examples of the communications structure and partitioningfor four examplesthat use only one
plash-2 processorarray: (a) median filter, (b) region detection and labeling, (c) FFT (forwardtransform),
nd (d) Hough transform. Solid squares at PE sites denote unused PES.
.ansform can be decomposed into multiple 1 D fast transform can search for other shapes in an image.) Each
ourier transforms (FFTs). For example, a 512 x 512 DFT boundary point in the original image specifies a curve in
sn be implemented as 512 1DFFT computations (one for the transform space. The coordinates of high-intensity
each row) followed by 512 additional 1D FFTs (one per points in the transform domain correspond to the position
column). We have implemented this on VTSplash using and orientation of best-fit lines in the original image.
floating-point arithmetic. Region detection and labeling is a common operation
The Hough transform, another 2D transformation, can for automated visual-inspection tasks. The purpose is to
be used after an edge-detection task to determine if a set identify connected regions in an image and assign a unique
of points lie on a straight 1ine.l' (The generalized Hough label to each region. This is not trivial when objects in the
Computer
-
image are nonconvex (such
as gears, blood cells, or Histogram ,
alphanumeric characters).
When receiving image data Region labeling
in raster-scan order, design- Median filter
1 Floating point ~
Pyramid generation
7
nect later at the bottom of
the image. Typical solu- Aorphological operations E======
tions to region labeling
often require a second pass
8 x 8 convolution
2D-FFT (floating point)
J
over the image to merge
connected regions. Once 0 100 200 300 400 500 600 700
this is done, unique labels Millions of operations per second
can be assigned to each
region. Figure 6. Approximate performance of image-processingtasks.
PERFORMANCE
RESULTS logic function units (word parallel) active in each task.
Computational properties, communications architec- The second component of the performance bar esti-
tures, and required resources vary significantly from one mates the number of storage references (memory
application to the next. All the examples described here accesses) performed by the task per second. The third
operate at the pixel clock rate of 10 MHz with 512 x 512 component represents the number of floating-point oper-
images. Many of the applications presented here have ations per second. All tasks, except for the 2D FFT appli-
been implemented using a pipeline architecture. The cation, use fixed-point operations. The pixel calculations
pipeline accepts digitized image data in raster order, often for the 2D FFT task use custom-designed floating-point
directly from a camera, and produces output data at the arithmetic. When combined, these three components pro-
same rate, with some latency. Many of these applications vide a basis for quantifymg the computational load of each
can be chained together to form higher level image-pro- task, as well as a rough estimate of the number of opera-
cessing functions. tions performed each second.
With VTSplash, the operating speed for an application
Prototype architectures is under the designer’s control and depends upon critical
Figure 5 shows simplified block diagrams that illustrate path delays in the implementation. The Splash-2 proces-
the partitioning and communication architecture for rep- sor features a programmable system clock that can be var-
resentative tasks. For example, Figure Sa shows the archi- ied under software control from zero to 40 MHz. We
tecture for the 3 x 3 median filter. This pipelined developed the tasks to satisfy the minimum criterion of
implementation produces output pixels at the same rate operating at the pixel data rate of 10 MHz. The designs
that input pixels are received, with a latency of less than wereverified at this rate only, although some of these tasks
the time required to receivethree image rows. This requires may operate well beyond this clock frequency.
the simultaneous access of nine neighboring pixels (pro-
duced by the gray shaded blocks labeledRowstack), which Processing rates
are presented to a parallel sorter in the gray shaded block In addition to quantifymg the number of operations per
labeled Parallel sort. The median value of the sorted list is second, it is useful to consider how fast computations are
then presented to the Format output block, which assem- performed relative to the 30-Hz frame rate of the input
bles the data for subsequent display on the monitor. image. Some tasks (histogramming, median filtering,
region labeling, Gaussian pyramid generation, and gray-
Performance evaluation scale morphologicaloperations) are completed during one
Conventional performance-benchmarking techniques frame time. Others (8x 8 convolution and Laplacianpyra-
are at best awkward when applied to custom computing mid generation) require two image frames. The floating-
machinery. Figure 6 graphically illustrates the computa- point FFT implementation can completelyprocess two 512
tional performance of each of these tasks executing on the x 512 images per second. The time necessary to complete
VTSplash platform. In this figure, the application name the Hough transform depends on the complexity of the
appears vertically to the left of the graph. The performance image; the implementation shown in Figure 5d distributes
bar assopiatedwith each task consists of two or three com- equal portions of an input image to separate PES that
ponents. The first component (arithmetic/logical) is an process in parallel.
appraisal of the number of general-purpose operations per-
formed on average per second. (These operations are likely Comparisons
to be found in the repertory of common RISC processors, Another method of benchmarking is to compare
such as Multiply, Xor, or Compare.) This number, when VTSplash operation with that of contemporary machines.
divided by the pixel clock frequency of 10 MHz, indicates We chose a general-purpose workstation (the Sun
the average number of the easilydiscerniblearithmetic and SparcStation- 10). VTSplash applications run between 10
February 1995
to 100times faster than the same application written in C 3. The Programmable Gate Array Data Book, Xilinx Inc., San
and executed on the SparcStation.Numerous commercial Jose, Calif., 1994.
machines have been designed specifically for image pro- 4. M. Gokhale and R. Minnich, “FPGAComputingin a Data-Par-
cessing. The Datacube MaxVideo 20Ol2consists of several allel C,” in Proc. IEEE Workshopon FPGAsfor Custom Com-
functional units carefully tuned to perform common puting, IEEE CS Press, Los Alamitos, Calif., Order No.
image-processing tasks. For applications that are suited 93TH0535-5,1993,pp. 94-101.
for its specific architecture, the MaxVideo 200 outper- 5. P. Athanas and H. Silverman, “Processor Redonfiguration
forms the VTSplash system. For example, the MaxVideo through Instruction-Set Metamorphosis:Architecture and
200 can perform 8 x 8 convolution four times faster than Compiler,” Computer, Vol. 26, No. 3, Mar. 1993,pp. 11-18.
our current VTSplash design. The motivation of the cus- 6. L. Agarwal and M. Wazlowski,“An AsynchronousApproach
tom computing approach, therefore, is not to provide the to Efficient Executionof Programs on Adaptive Architectures
fastest possible performance for a given task. As illustrated UtilizingFPGAs,” Proc. IEEE Workshop on FPGAsfor Custom
by VTSplash, the strength of this approach is a system that Computing, IEEE CS Press, Los Alamitos, Calif., Order No.
can be rapidly reconfigured to provide high performance 5490-02U, 1994,pp. 111-119.
for a wide range of tasks. The performance of application- 7. R.C. Vogt, Automatic Morphological Set Recognition Algo-
specific systems diminishes quickly for tasks not directly rithms, Springer-Verlag,New York, 1989.
supported in hardware. 8. B. Jtihne,Digital ImageProcessing,Springer-Verlag,New York,
1991.
9. A.L. Abbott, R.M. Haralick, and X. Zhuang, “PipelineArchi-
RECONFIGURAEILE COMPUTING PLATFORMS, such as Splash- tectures for MorphologicImage Analysis,”Machine Vision and
2, can readily adapt to meet the communication and com- Applications,Vol. 1,No. 1,1988, pp. 23-40.
putational requirements of a wide variety of applications. 10. P.J. Burt and E.H. Adelson, “TheLaplacian F’yramid as a Com-
By adding VO hardware, we have demonstrated that gen- pact Image Code,”IEEE Trans. Comm., Vol. COM-31, No. 4,
eral-purpose custom computing machines are well suited Apr. 1983,pp. 532-540.
for many meaningful image-processingtasks. Such plat- 11. L. Abbott et al., “Finding Lines and Building Pyramids with
forms are excellent testbeds for prototyping high-perfor- Splash-2,”Proc.IEEE Workshopon FPGAs for Custom Com-
mance algorithms. The custom computing platform can puting, IEEE CS Press, Los Alamitos, Calif., Order No. 5490-
also serve as 02U, 1994, pp. 155-163.
12. TheMaxVideoZOOReferenceManual,Datacube Inc., Danvers,
a medium for hardware/software codesign, and Mass., 1994.
a VHDL accelerator.
Computer