You are on page 1of 9

A Dynamic Instruction Set Computer

Michael J. Wirthlin and Brad L. Hutchings


Dept. of Electrical and Computer Eng.
Brigham Young University
Provo, UT 84602

Abstract during application execution can provide more hard-


A Dynamic Instruction Set Computer (DISC) has ware resources than is available on a one-time con g-
been developed that supports demand-driven modi - ured FPGA. This technique, known as run-time re-
cation of its instruction set. Implemented with par- con guration (RTR), has been shown to increase the
tially recon gurable FPGAs, DISC treats instructions functional density of recon gurable FPGAs[6]. The
as removable modules paged in and out through par- DISC processor uses RTR to ameliorate FPGA hard-
tial recon guration as demanded by the executing pro- ware limitations and provide an essentially limitless
gram. Instructions occupy FPGA resources only when application-speci c instruction set.
needed and FPGA resources can be reused to imple- Early attempts in modifying a processor instruc-
ment an arbitrary number of performance-enhancing tion set involved a writable control store and gen-
application-speci c instructions. DISC further en- erating custom micro-code for each application[14].
hances the functional density of FPGAs by physi- The PRISM project extended this idea by augmenting
cally relocating instruction modules to available FPGA the instruction set of a standard RISC processor with
space. application-speci c instructions on a tightly coupled
FPGA. Hardware images of these instructions are ex-
1 Introduction tracted and compiled from the source code transpar-
ent to the user[2]. The WASMII project discusses a
Developing customized stored-program processors more dynamic approach that involves swapping hard-
is a convenient design technique that combines the ware compute con gurations in and out of the FPGA
enhanced performance of application-speci c circuits resource as demanded by the data- ow token[9].
with the exibility of general-purpose programmable The DISC processor implements each instruction in
processors. Application-speci c instruction sets, cus- the instruction set as an independent circuit module.
tomized I/O and optimized control can substantially The individual instruction modules are paged onto the
improve the performance of even the simplest pro- hardware in a demand-driven manner as dictated by
grammable processors. FPGAs provide an excellent the application program. Hardware limitations are
implementation platform for application speci c pro- eliminated by replacing unused instruction modules
cessors because of the quick development time and with usable instructions at run-time. An application
simpli ed design process. In addition, SRAM based running on DISC contains source code, indicating in-
FPGAS provide the ability to recon gure more than struction ordering, and a library of application-speci c
one distinct application-speci c processor on a single instruction circuit modules.
device.
A number of general purpose processors have This paper will begin by describing the techniques
been developed to show the feasibility of implement- used to implement DISC. These include partial recon-
ing a processor architecture on an FPGA[5, 7, 17]. guration, relocatable hardware, and the linear hard-
Several custom processors have successfully demon- ware model. The architecture of the DISC processor
strated the advantages of adding specialized hard- will be presented along with several example custom
ware to general purpose processor cores. Applica- instructions. The DISC processing system, including
tion areas for these processors include digital audio software and hardware platform, will be described.
processing[16], systems of linear equations[17], and The paper will conclude by presenting results from
statistical physics[12]. an algorithm implemented on DISC.
One limitation of building customized processors 2 Partial FPGA Recon guration
on FPGAs is the lack of hardware resources avail-
able for specialized instruction sets. A few hardware- DISC takes advantage of partial FPGA con gura-
intensive instruction modules can quickly consume all tion to implement dynamic instruction paging. Partial
the resources of even the largest FPGAs available to- recon guration provides the ability to con gure a sub-
day. Recon guring an FPGA to replace idle circuitry section of an FPGA while remaining logic operates
una ected. Although all SRAM-based FPGAs can be
 This work was supported by ARPA/CSTO under contract recon gured in-circuit, only the CAL[1], Atmel[3], and
number DABT63-94-C-0085 under a subcontract to National National Semiconductor[13] FPGAs support the abil-
Semiconductor ity to partially recon gure hardware resources.
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 2
Although few partially recon gurable systems processor begins execution. The sequencing of instruc-
have actually been implemented, several have been tions on a small FPGA may execute and con gure as
proposed such as hardware multi-tasking[10], a follows:
multi-phase serial communication algorithm[11], a
data acquisition system[4], and a self-recon guring Operation Instruction
processor[8]. In addition, caching logic to in- Con gure INSTA Con gure INSTA on FPGA
crease hardware eciency in standard digital sys- Execute INSTA Execute rst INSTA
tems has been proposed using partially recon gurable Execute INSTA Execute second INSTA
FPGAs[15]. Con gure INSTB Con gure INSTB on FPGA
DISC uses partial con guration to implement Execute INSTB Execute rst INSTB
custom-instruction caching. Instruction modules are Con gure INSTC Con gure INSTC on FPGA
implemented as partial con gurations and individu- Execute INSTC Execute rst INSTC
ally con gured on DISC as demanded by the applica- Execute CMP Execute CMP (always available)
tion program. Before initiating execution of a custom- Execute JNE Execute JNE (always available)
instruction, DISC queries the FPGA for the pres- (continue looping to INSTC until JNE fails)
ence of the custom-instruction con guration. If the Remove INSTA FPGA full, remove oldest module
custom-instruction is on the FPGA, execution is initi- Con gure INSTD Con gure INSTD
ated. Otherwise, program execution pauses while the Execute INSTD Execute INSTD
custom-instruction is con gured on the FPGA. Execute INSTB Execute second INSTB
As a typical program executes, custom-instructions Remove INSTC FPGA full, remove oldest module
are con gured onto the FPGA until all available hard- Con gure INSTE Con gure INSTE
ware is consumed. When all hardware is used by the Execute INSTE Execute INSTE
custom-instructions, new custom-instruction modules
may not be con gured on the FPGA until enough ex-
isting hardware is removed. By replacing the oldest In the previous example, it is assumed that the rst
custom-instruction modules on the FPGA with newer ve instructions (INSTA, INSTB, INSTC, CMP, and
modules, the FPGA serves as a cache of the most- JNE) consume all available space on a single FPGA.
recently used custom-instruction modules. Partially con guring the FPGA allows two additional
2.1 Example instructions (INSTD and INSTE) to execute on an oth-
The following assembly language source code exem- erwise full FPGA.
pli es the use of partial con guration on DISC: 2.2 Advantages
Partial con guration provides a number of advan-
begin:
tages for DISC over conventional con guration meth-
;instruction INSTA operates on
ods. First, idle instruction modules can be removed to
;memory location mem1 make room for other usable modules. The ability to
INSTA mem1 replace instruction modules in the system at run-time
INSTA mem2 allows the implementation of an instruction set much
;instruction INSTB operates on larger than is possible on a single one-time con gured
;mem3 and mem2 FPGA.
INSTB mem3,mem2 Second, con guration time is substantially reduced.
;"loopback" label defined Although the DISC FPGA could be completely con-
loopback: gured every time a new instruction is needed, con g-
INSTC mem3 uration overhead can be dramatically reduced by con-
guring only the requested instruction. Reducing the
;instruction CMP compares
size of hardware to con gure signi cantly reduces the
;mem1 with mem3
con guration bit-stream. Con guration bit-stream re-
CMP mem1,mem3
;instruction JNE jumps
ductions for DISC instruction modules fall between 601
;to loopback if not equal
and 31 of a complete FPGA con guration. With a sig-
JNE loopback
ni cantly smaller bit-stream, the corresponding con-
guration time is reduced. In an environment of run-
continue:
time con guration, reducing the con guration time
INSTD mem3 will limit the recon guration overhead.
INSTB mem2 Third, system state can be saved on the FPGA dur-
INSTE mem3 ing con guration. Conventional con guration tech-
end: niques prevent the preservation of system state during
con guration by destroying the contents of all ip-
Once each instruc- ops. Implementing DISC with conventional con gu-
tion in the previous program (INSTA, INSTB, INSTC, ration methods would require the saving and restor-
CMP, JNE, INSTD, and INSTE) has been designed as ing of system state (program counter, register values,
an independent partial con guration, the source code etc.) every time a con guration occurs. To prevent
representing the program is loaded into DISC and the the time-consuming process of saving and restoring
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 3
state, DISC implements a global controller that re- no a ect on the physical layout or placement of any
mains on the FPGA at all times. other module in the library.
In summary, partial con guration allows DISC
to implement an essentially in nite instruction set
in hardware with limited con guration and state-
4 Linear Hardware Space
DISC implements relocatable hardware in the form
preserving overhead. of a linear hardware model. As the name suggests, the
3 Relocatable Hardware model is based on a linear, one-dimensional hardware
space. The two-dimensional grid of con gurable logic
The ability to partially con- cells are organized as an array of rows: location is
gure custom-instruction modules allows DISC to im- speci ed by vertical location and module size is spec-
plement an important strategy - relocatable hardware. i ed by module height (in rows).
Relocatable hardware, implemented only in partially The global context for the linear hardware model
con gurable FPGAs, provides the ability to relocate or consists of a uniform communication network and a
make placement decisions of partial con gurations at global controller. The communication network is con-
run-time. Although not essential for a general purpose structed by running each global signal vertically across
processor, it is used on DISC to substantially improve the die and spreading the global signals across the
run-time hardware utilization. width of the die parallel to each other (see Figure 1).
Sub-modules in traditional digital systems require
a single xed location in hardware because of strict
global and local physical constraints. Because sub-
modules in traditional systems are not paged in and
out of hardware, a xed location does not pose any
problems and global optimizations can be made on the
static circuitry to improve hardware utilization. In a Global Controller

Linear Hardware Space


run-time partial recon gurable system, however, xed
locations for partial con gurations can pose serious
performance problems.
If DISC modules are designed for a single physi-
cal location, instructions in the library will inevitably Communication
overlap each other on the hardware. Two overlap- Network
ping instructions can never operate properly on the
FPGA at the same time. If two overlapping instruc-
tions are used frequently together in an application
program, the con guration overhead needed to replace
the instructions quickly becomes the system bottle-
neck. DISC removes these problems by designing each
custom-instruction module for multiple locations on
the FPGA.
The exibility of multiple locations for DISC
custom-instructions signi cantly improves run-time =
utilization. Instruction modules are initially con g- I/O Disabled
ured on the FPGA as close as possible to avoid wasted
hardware between modules. Once the hardware space
is full, additional instruction modules are placed in Figure 1: Linear Hardware Space.
locations where older unneeded instruction modules
currently lie. Relocatable hardware allows run-time The communication network provides access to
constraints and conditions to dictate instruction mod- global resources for all instruction modules and per-
ule placement for optimal hardware utilization. forms intermodule communication. The global con-
Relocatable hardware is implemented by design- troller speci es the communication protocol, controls
ing custom-instruction modules around a rmly de- global resources (such as I/O and global state) and
ned global context. A global context provides physi- monitors circuit execution. The global controller and
cal placement positions and a communication network the communication network remain in the same loca-
necessary for these modules to operate correctly. The tion throughout application execution to preserve the
global context partitions the available hardware into global context.
an array of potential placement locations for the relo- To gain access of all global signals, sub-modules
catable instruction modules. The communication net- within a linear hardware space are designed horizon-
work is provided at each placement location to insure tally, across the width of the FPGA. The modules
adequate communication between the global controller lie perpendicular to the global communication signals
and the instruction modules at any location. for full access of all global signals regardless of their
In order to design instruction modules that t vertical placement (see Figure 2). Although all sub-
within the global context, all instruction modules modules must span the entire width of the FPGA, each
must be physically independent from each other. The module may consume an arbitrary amount of hard-
physical layout of any instruction module must have ware by varying its height.
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 4
and global state. The global controller consumes ten
complete rows (approximately 1/6 of the chip) leav-
Global Signals ing 46 rows available for custom-instruction modules.

Module placed in any vertical location


The physical layout of the global controller, estimated
at 1007 gates, along with the communication network
is seen in Figure 4.
16 15 14 13 12 10 9 8 3 2 132 131 129 128 127 126 125 124 123 122 121 120 119 118

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

BUF INV MUX FDMUX BUFZ BUFZ BUFZ BUFZ BUFZ BUFZ BUFZ BUFZ

ND2 FDMUX MUX AN2L INV MUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX

18
AN2L AN2 FDOR AN2L OR INV MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ MUX BUFZ BUFZ

BUF AN2 AN2L FD OR INV ONE FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX

19
AN2 AN2L AN2L FD FDOR OR ONE FDN OR AN2L ZERO

FD AN2 AN2 MUX AN2L AN2L ND2 AN2L FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ FDMUX BUFZ MUX BUFZ

20
AN2L AN2 AN2 AN2 INV FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX FDMUX

FDMUX MUX AN2L AN2 AN2 AN2L AN2 INV XOND INV XOND INV XOND INV XOND INV XOND INV XOND INV XOND INV XOND ONE

Width of FPGA
Figure 4: DISC Global Controller Layout.
Figure 2: Simpli ed Custom Instruction Module. The architecture of the global controller is seen
in Figure 5 and is comprised of the following sub-
modules:
Relocatable circuit modules communicate as estab-  Data Register (DR): stores intermediate results,
lished by the global protocol and thus operate properly provides inter-module communication bu ering
at any vertical location. In a run-time environment, and assists in complex address generation (8 bits),
these circuit modules can be relocated as needed to
optimize the available hardware space.  Address Register (AR): provides standard ad-
dressing modes for memory access (16 bits),
5 DISC Architecture  Program Counter (PC): provides the sequencing
The DISC architecture implements relocatable capability of the processor (16 bits),
hardware with the linear hardware model on a sin-  Status Register (SR): stores internal state of the
gle National Semiconductor CLAy31 FPGA coupled processor (4 bits),
to an external RAM. The CLAy31 provides a 56 x  Instruction Register (IR): stores the opcode of
56 array of ne-grain logic cells allowing 56 complete the current instruction (8 bits),
rows in the linear hardware space. A complete proces-  Global Control Unit (GCU): contains the cir-
sor is made by coupling a global controller to a library cuitry necessary to preserve communication pro-
of custom-instruction circuit modules (see Figure 3). tocol, sequence through processor states, and in-
terface with I/O.
Instruction
Module
Library Status Global Control Memory Control
Processor Memory Status Register
Add Unit
Memory Address Memory Address

To External Memory
Subtract
To Custom Instructions

Opcode
Multiply Instruction R. Program Counter

AND
Global Control Address Register

Memory Data Memory Data


Instruction Module A
Custom Module 1
Data Register Feedback Data Register
trol ress
Con Add Data Custom Module 2 Data Register Value

a+b-c^d
Instruction Module B Edge Detection
FFT
Figure 5: DISC Global Controller Architecture.
The global controller provides a consistent com-
Figure 3: DISC Linear Hardware Space. munication interface and standard protocol for all
custom-instructions at every vertical location. The
5.1 Global Controller global signals available to the custom-instructions in-
clude the following:
The global controller provides the circuitry for op-
erating and monitoring global resources such as the ex-  Data Register Value: accesses contents of Data
ternal RAM, I/O, the internal communication network Register (8 bits),
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 5
 Data Register Feedback: provides new values for
Data Register (8 bits), IF OF EX
 Memory Address: allows address generation con-
trol by custom-instructions (16 bits), Standard Instruction Sequence
 Memory Data: allows bi-directional access of
memory data by custom-instructions (8 bits),
 Status Signals: provides control capability for IF OF CC ... CC EX
custom-instructions (4 bits),
 Instruction Register: provides opcode of current Custom Instruction Sequence
instruction (8 bits).
The global controller is also responsible for sequenc- Figure 6: DISC Instruction Sequences.
ing through the instruction cycles for the custom-
instruction modules. The following instruction cycles
are implemented by the global controller:  load data register: load data register from mem-
ory,
 Instruction Fetch (IF),  conditional jump: jump with carry not set.
 Operand Fetch (OF),
 Halt Processor (HP), Each of these instructions follow the standard in-
 Custom Cycle (CC), struction sequence of three cycles. These instructions,
 Instruction Execution (EX). coupled with the custom-instruction library designed
for a particular application, provide the complete in-
The IF cycle stores the current program memory struction set of the processor. An application can im-
into the instruction register and increments the pro- plement an instruction set of any size by paging in-
gram counter. The OF cycle stores the current pro- struction modules in a demand-driven manner from
gram byte into the address register and also incre- the instruction library.
ments the program counter. The HP cycle causes all 5.2 Custom-instruction Modules
processor resources to remain idle and is used dur- Custom-instruction modules vary in size and com-
ing con guration. The CC cycle is used by complex plexity, but each is designed to t within the global
custom-instruction modules for adding additional cy- context described above. Speci cally, each module
cles and has no a ect on global resources. The EX contains a decode and a data-path unit. Complex
cycle loads the value of the data register with the con- modules contain additional control structures.
tents of the data register feedback path. The decode unit assigns a speci c op-code to the
Each instruction in the library operates in one of custom instruction and is responsible for acknowledg-
two possible instruction cycle sequences: standard ing its presence to the global controller. The decode
and custom. The standard instruction sequence fol- unit compares the contents of the IR for a match
lows a simple three-cycle execution: IF, OF, and EX. against its own opcode during the OF cycle. On a
Any instruction that completes its computation or positive match the module signals the global controller
function in a single clock cycle, such as basic arith- that the hardware is present and instruction sequenc-
metic and logic operations, will operate with this se- ing continues.
quence. The data-path is responsible for providing the
The custom-instruction sequence o ers additional proper connections to the global communication net-
cycles for complex custom-instructions. The custom work and adhering to the established communication
sequence begins with the following two cycles: IF protocol. Instruction modules not executing refrain
followed by OF. The sequence then varies by insert- from sending any signals on the communication chan-
ing as many CC cycles as necessary to complete a nel to prevent the corruption of other operating in-
complex application-speci c operation. The custom- structions. The data-path unit provides a new value
instruction sequence completes with the EX instruc- for the data register during the EX stage. Most in-
tion cycle. The custom-instruction module has com- structions perform their function by modifying the
plete control over the number of CC cycles needed for DR.
a particular function. Some instructions add as few as Several custom-instruction modules of varying size
one cycle, while others require thousands of cycles for have been implemented on DISC. These vary from a
a single operation. Figure 6 displays the two instruc- simple single row shifter to a complex edge-detection
tion sequences. module of 34 rows. Table 1 shows the current instruc-
The global control unit contains a number of de- tions available for DISC. The circuit layout for the
fault instructions necessary for controlling global re- Adder/Subtracter module is seen in Figure 7.
sources. These instructions are used for sequencing,
status control, and memory transfer and include the
following:
6 System Operation
The DISC processor was implemented on a PC-
 set carry: sets carry bit in status register,
ISA custom board made exclusively for the study.
 clear carry: clears carry bit in status register,
The board includes static bus interface circuitry, two
CLAy31 FPGAs, and memory. A con guration con-
 store data register: store data register in memory, troller is implemented on the rst FPGA to monitor
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 6
Upon receiving a request for an instruction mod-
Module Rows Gates ule, the host evaluates the current state of the DISC
Shifter 1 50 FPGA hardware and chooses a physical location for
Comparator 3 155 the requested module. The physical location is chosen
Add/Subtract 3 153 based on available FPGA resources and the existence
Addressing Modes 4 447 of idle instruction modules. If possible, the instruc-
Masking Operations 5 193 tion module is loaded in an FPGA location not cur-
Logical Operators 9 232 rently occupied by any other instruction module. If no
Big-Level Operations 9 296 empty hardware locations are available, a simple least-
Mean Filter 31 2156 recently-used (LRU) algorithm is used to remove idle
Edge Detector 33 2221 hardware. The host modi es the bit-stream of the
requested hardware module to re ect the placement
changes. The hardware module is then con gured on
Table 1: Sample Custom Instruction Modules. the DISC platform by sending the new con guration
to the system. Figure 9 provides a simpli ed ow chart
of DISC instruction execution.
AN2L INV AN2L AN2L AN2L AN2L AN2L AN2 BUFZ BUFZ XO2 BUFZ XO2 BUFZ XO2 BUFZ XO2 BUFZ XO2 BUFZ XO2 BUFZ XO2 BUFZ XO2

AN2 XO2 ND2 XOND ND2 XOND ND2 XOND ND2 XOND ND2 XOND ND2 XOND ND2 XOND ND2 XOND

OR XO2 XOND XOND XOND XOND XOND XOND XOND XOND

Figure 7: DISC Adder/Subtracter Custom Module


Fetch
Instruction

Layout.
YES
Instruction
Present?

processor execution and request instructions from the


host. DISC is implemented on the second FPGA and
NO

the application program memory is stored in the adja- Hardware YES

cent memory (see Figure 8). The board operates under Available?

a UNIX-based operating system and is controlled by NO

a host device driver. Remove


Old
Instruction(s)

Compute
DISC Configuration
RAM

New
Processor Controller Location

CLAy 31 CLAy 31
Configure
Instruction PC
Module
PC
Host Bus Interface Execute
Instruction

ISA Bus

Figure 9: DISC Instruction Execution.


Figure 8: DISC System.
One drawback of partially con guring the device
Performance has not been a main consideration as during run-time is the overhead caused by continually
DISC was implemented primarily to study dynamic recon guring instruction modules. The current board
instruction set modi cation through partial recon g- con gures the DISC processor by sending the con g-
uration. As a research tool, the processor is 8 bits uration bit-stream one bit per bus transfer over the
and operates at the host bus speed of 7.5 MHz (max- PC-ISA bus. Operating at a maximum transfer rate
imum operating speed calculated at 12 MHz). Pro- of 1.5 Mb/sec, the PC-host is capable of con guring
cessor widths and operating speeds can be increased one row in 600 us. This represents 4511 processor cy-
as device densities increase and tool enhancements be- cles or 1500 simple instruction executions for each row
come available. con gured. By removing the current system board
A DISC application is initiated by rst, loading the and bus limitations, con guration speeds improve by
program memory with the target application, and sec- a factor of 64 and operate at the device maximum of
ond, con guring the DISC FPGA with the global con- 12 MB/sec.
troller. During execution, the processor validates the Custom instruction modules should remain resident
presence of each instruction in the hardware. If the in- in the processor for long periods of time to decrease the
struction requested by the application program does recon guration overhead. In addition, custom instruc-
not exist on the hardware, the processor enters a halt- tion modules should provide enough performance im-
ing state and requests the instruction module from the provement over a sequence of general purpose ALU in-
host. structions to justify the cost of recon guration at run-
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 7
time. The following application example will demon- simple instructions used in the general purpose ap-
strate this tradeo . proach.
The MEAN instruction module calculates the aver-
7 Application Example age of a 3x3 neighborhood through the use of a sliding
A simple image mean lter was developed as both window as seen in Figure 11. Each numbered element
a sequence of general purpose instructions and as an of the sliding window represents a pixel register in the
application speci c hardware module to demonstrate custom module. Instead of loading the entire window
the performance improvements gained by tailoring the from memory at each pixel, register values are shifted
hardware to the application. Both demonstrations to represent a sliding window (see Figure 12). Only
calculate the mean value of each pixel in an image, registers 3, 6, and 9 are loaded at each new pixel.
g(x; y), by obtaining an average over a 3x3 neighbor-
hood as follows:

g(x; y) = 81
X X g(x + m; y + n):
1 1

m=?1 n=?1 1 2 3
4 5 6
A coecient of 81 was used to simplify the design. The 7 8 9
128 x 64 grey scale image in Figure 10 was used as the
test image for both cases.

Figure 11: Sliding Pixel Window.


With the window registers loaded, the custom in-
struction module adds all nine pixel values in parallel
with eight custom adders as seen in Figure 12. The di-
vision by eight is achieved by shifting the results three
bit positions.

3 2 1 6 5 4 9 8 7

Figure 10: Original Test Image.

Shift
+ + + + + + + +

7.1 General Purpose Approach


The general purpose approach required four in- Figure 12: Data ow of MEAN Instruction Module.
structions not found in the processor core: add, sub-
tract, shift, and enhanced addressing modes. These
additional modules comprised a total of 8 rows, leav- The MEAN instruction requires only 7 clock cycles
ing 38 rows free for other custom instruction modules. to evaluate each pixel of the image. The clock cycles
Execution of the algorithm centered in the in- are scheduled as follows:
ner loop calculation of the 3x3 neighborhood mean
value. Calculating each pixel value involved individu- 1. Load register 3
ally adding each pixel of the neighborhood. Many of 2. Load register 6
the instructions used for this summing operation in-
volved address calculation and pointer manipulations. 3. Load register 9
Computation of each pixel nishes with three shifts 4. Wait (add delay to parallel add)
for the division by eight. 5. Write results to image memory
Complete processing of a pixel required an aver-
age 160 instructions or 560 clock cycles. Processing 6. Calculate new address
the complete image, including overhead, required 4.59 7. Shift register window
Mclocks or 610 ms (7.5 MHz).
7.2 Application Speci c Approach Reducing the pixel calculation to seven clock cy-
The application speci c approach signi cantly im- cles and eliminating much of the address calculation
proves performance of the algorithm by assuming con- overhead reduces the clock count from 4.59M in the
trol of address generation, bu ering pixel values, and general purpose case to 57k for an 80 times speedup.
pipelining the arithmetic. With 31 rows of hardware, Operating at 7.5 MHz, the image is ltered in 7.6 ms.
the extra registers, arithmetic operators and control Figure 13 displays the image ltered with the MEAN
logic consume signi cantly more hardware than the custom instruction.
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 8
Although the techniques of partial con guration,
relocatable hardware, and the linear hardware model
were implemented as a general purpose processor,
they o er similar advantages to other digital archi-
tectures. They may enhance the usefulness of FPGA
co-processors by providing demand-driven computa-
tion. In addition, these techniques may allow FPGA
based computing machines to operate in more dy-
namic environments such as multi-tasking operating
systems. Any digital architecture that could bene t
from demand-driven hardware may nd these tech-
niques useful.
Figure 13: Test Image Filtered Through MEAN Cus-
tom Instruction. References
[1] Algotronix, Edinburgh, UK. CAL1024 Prelimi-
nary Data Sheet, 1988.
7.3 Con guration Overhead [2] P. M. Athanas and H. F. Silverman. Processor
Because the cost of recon guring the application- recon guration through instruction-set metamor-
speci c instruction module is so high, con guration phosis. Computer, 26(3):11{18, March 1993.
overhead must be considered when comparing the two
approaches. The 31 row MEAN instruction requires [3] Atmel, San Jose, CA. Con gurable Logic: Design
an additional 140 kcycles for con guration, raising the & Application Book, 1993-1994.
total cycle count to 197 kcycles. The MEAN con gu-
ration overhead represents 71% of the total operating [4] R. Camerota and J. Rosenberg. Data acquisition
time. If device con guration speeds are maximized, systems using Cache Logic FPGAs. In Con g-
this con guration overhead is reduced to 16% of the urable Logic: Design & Application Book, pages
total operating time. 7.15{7{18. Atmel, San Jose, CA, 1993-1994.
The extra four modules needed for the general pur-
pose approach require only 36 kcycles for con gura- [5] J. Davidson. FPGA implementation of a recon-
tion. This represents less than 1% of the total operat- gurable microprocessor. In Proceedings of the
ing time. When considering the high-cost of con gura- IEEE 1993 Custom Integrated Circuits Confer-
tion in total operating time, the MEAN lter custom ence, pages 3.2.1{3.2.4, 1993.
instruction provides a 23 times speedup to the general
purpose approach (see Table 2). [6] J. G. Eldredge and B. L. Hutchings. Density en-
hancement of a neural network using FPGAs and
run-time recon guration. In D. A. Buell and K. L.
General Application Pocek, editors, Proceedings of IEEE Workshop on
Purpose Speci c FPGAs for Custom Computing Machines, pages
Rows 8 31 180{188, Napa, CA, April 1994.
Operation Cycles 4.59M 57k
Raw Speedup 1 80 [7] B. S. Fagin. Quantitative measurements of FPGA
AreaTime 36.7M 1.8M utility in special and general purpose processors.
Con guration Cycles 36k 140k Journal of VLSI Signal Processing, 6(2):129{137,
Total Cycles 4.63M 197k August 1993.
Actual Speedup 1 23.5 [8] P. C. French and R. W. Taylor. A self-
recon guring processor. In D. A. Buell and K. L.
Pocek, editors, Proceedings of IEEE Workshop on
Table 2: Performance Comparison between General FPGAs for Custom Computing Machines, pages
Purpose and Application Speci c Approaches. 50{59, Napa, CA, April 1993.
[9] X. P. Ling and H. Amano. WASMII: a data driven
8 Conclusions computer on a virtual hardware. In D. A. Buell
The DISC processor successfully demonstrates that and K. L. Pocek, editors, Proceedings of IEEE
application speci c processors with arbitrarily large Workshop on FPGAs for Custom Computing Ma-
instruction sets can be be constructed on partially chines, pages 33{42, Napa, CA, April 1993.
recon gurable FPGAs. The relocatable hardware
model improved run-time utilization of FPGA re- [10] P. Lysaght. Dynamically recon gurable logic
sources and the linear hardware model provided a con- in undergraduate projects. In W. Moore and
venient framework for relocating custom instruction W. Luk, editors, FPGAs: Proceedings of the 1991
modules. DISC demonstrates the general concept of International workshop on eld-programmable
alleviating density constraints of FPGAs by partially logic and applications, Oxford, England, Septem-
recon guring a device at run-time. ber 1991. Abingdon EE and CS Books.
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 19-21, 1995. 9
[11] P. Lysaght and J. Dunlop. Dynamic recon gura-
tion of FPGAs. In W. Moore and W. Luk, edi-
tors, More FPGAs: Proceedings of the 1993 In-
ternational workshop on eld-programmable logic
and applications, pages 82{94, Oxford, England,
September 1993.
[12] S. Monaghan and C. P. Cowen. Recon gurable
multi-bit processor for DSP applications in statis-
tical physics. In D. A. Buell and K. L. Pocek, ed-
itors, Proceedings of IEEE Workshop on FPGAs
for Custom Computing Machines, pages 103{110,
Napa, CA, April 1993.
[13] National Semiconductor. Con gurable Logic Ar-
ray (CLAy) Data Sheet, December 1993.
[14] T. G. Rauscher and A. K. Agrawala. Dy-
namic problem-oriented rede nition of com-
puter architecture via microprogramming. IEEE
Transactions on Computers, C-27(11):1006{1014,
November 1978.
[15] J. Rosenberg. Implementing Cache Logictm with
FPGAs. In Con gurable Logic: Design & Appli-
cation Book, pages 7.11{7.14. Atmel, San Jose,
CA, 1993-1994.
[16] M. J. Wirthlin, B. L. Hutchings, and K. L. Gilson.
The Nano Processor: A low resource recon g-
urable processor. In D. A. Buell and K. L. Pocek,
editors, Proceedings of IEEE Workshop on FP-
GAs for Custom Computing Machines, pages 23{
30, Napa, CA, April 1994.
[17] A. Wolfe and J. P. Shen. Flexible processors:
a promising application-speci c processor design
approach. In Proceedings of the 21st Annual
Workshop on Microprogramming and Microarchi-
tecture - MICRO '21, pages 30{39, San Diego,
CA, November 1988.

You might also like