New Methodology for Asynchronous Digital Circuit Design

A New Methodology for the Design of Asynchronous Digital Circuits
S. K. Roy*
S. K. Desai
E(. Nanda
Department of Electrical Engineering

Indian Institute of Technology, Kanpur-208016.
skro y@iitk.ernet. in
2. Data Signal Representations and Con-
Abstract
trol Protocols
T h i s paper discusses a n e w design methodology f o r asynchronous digital circuits. T h e methodology as based on a n
event driven scheme and follows t h e double-rail logic handshake protocol. A n e w logic gate, called the Universal Gate,
is designed; this is the basic building block of the methodology. It i s shown t h a t t h e methodology is completely delay
insensitive. A s an example, the Shift Multiplier (71,is implemented.
In synchronous circuits, the global clock provides the

communication link between the functional units. In the absence of the clock, alternative schemes (refered to as handshaking) need to be deviced to achieve the communication
[l].This section briefly describes the doiible-rail logic handshake protocol.
In double-rail logic, each input and output requires a
pair of wires carrying both the data and the handshaking
information. One double-rail coding scheme that has been
used in a number of designs is the Three-state coding [4].
A completely asynchronous microprocessor has been built
using this coding scheme [5]. The code has three detectable
states: logical 1, logical 0 and a null state. After transmitting each bit of data (a 1 or a 0), a null N is transmitted as
a separator. The states can be represented in a number of
ways: the TITAC [5] representation is
00 = acknowledge-null
01 = data value-logical 0
10 = data value-logical 1
11 = not allowed.
This particular coding prevents both wires from switching
at the same time. Also, since a null has to be present between two data bits, only one transition occurs at one time
thus preventing hazardous operation.
The use of null to define data simplifies the detecting
logic, but is inefficient. It limits the throughput because
two signals, data and a null, must be propagated serially
for every bit of useful information transmitted. To increase
the throughput requires that the null not be transmitted
after every data bit, taking into consideration that a race
condition should not develop.Also transmitting a null will
increase the power dissipation. A way to ensure this is that
only one transition should be required to signal a data bit.
One way to achieve this is the Non-Return-to-Zero event
driven scheme.
In an event driven scheme [6], any transition, either rising or falling, signifies a data bit. Either kind of transition
is called an event. All responses are edge triggered and are
triggered by both the falling and the rising edge. Transition
1. Introduction
As the number of active devices on a chip and the device
speed increase and the minimum feature size goes down, the
clock becomes a major limiting factor in synchronous circuits[l]. Theoretically, an asynchronous design has several
advantages over the synchronous design. Due to the absence of a clocking signal, there is no clock skew problem
and there is a decrease in the power consumption. Also,
asynchronous circuits give an average case instead of a worst
case performance and result in easing of global timing issues. They also have a better potential for technology migration. Also, due to the absence of a clocking signal asynchronous circuits are, in general, more difficult to design.
Though many methodologies exist for asynchronous design
[3], automation of design has still not been fully achieved.
In the sections that follow, a novel method for designing
asynchronous circuits is outlined. It is based on a new basic gate whose design has been given here. This, it is hoped,
will make the design methodology easy to automate.
The next section overviews a general technique for communicating asynchronously. A new gate design, called the
Universal Gate, on which this methodology is based is presented in Section 3. Logic minimization techniques are discussed in Section 4. These are illustrated in detail by considering the design of an ADDER. Memory interface circuits
are presented in Section 5 and we conclude in Section 6 by
integrating the principles developed and applying them to
the design of a shift multiplier [7].
0-8186-7755496$05.00 0 1996 IEEE
342
lo"hInternational Conference on VLSI Design -January I997
signaling is, for example, used for controlling data (bundled

data) transfer in Micropipelines [6]. In the methodology
proposed here, an event driven scheme is used both for data
transfer and for control signaling.
In the two rail representation, adopted here for data bits,
a 1 bit is indicated by a transition on one of the two rails
(called, say, rail r l ) and a 0 bit by a transition on the other
rail (ro). At any particular instant an event can occur on
only one of the two rails. Events on both rails occuring
simultaneously will imply an invalid data and are therefore
not allowed.
I (+-.-
3. The Universal (U) Gate

11-
It is not possible t o use normal level sensitive logic elements such as AND, OR, INVERT t o realise equivalent
logic blocks for the asynchronous counterparts based on our
way of representing data values. This is the motivation for
realising the Universal gate proposed here. We show how t o
render this gate to give the AND , OR, INVERT elements in
the NRZ asynchronous paradigm. Furthermore, we donot
seperate the data representation from the control signalling
as in the Micropipeline case. This we conjecture will enable a uniform and simpler way t o synthesize asynchronous
designs.
A two input AND gate actually has four input lines and
two output lines. The AND gate generates an event on
the output1 line if and only if all the inputl lines show
transitions. If any of the inputo lines shows a transition
then the outputo generates an event. It must be noted here
that an output event occurs only after all the input events
have been received.
It can be seen that the two-rail AND and OR gates are
essentially the same circuit. If we interchange the two input
rails of a pair for all the pairs, and also interchange the
output rails, then the AND circuit becomes an OR circuit
and the OR becomes the AND.
The two-rail NOT gate is realised simply by interchanging the inputl and the inputo rails. Now a transition on the
inputo,l rail will result in a transition on the outputl.0 rail
which realises the two-rail NOT gate.
As seen in the above discussion, only one gate needs t o be
realised t o implement any boolean function in the two-rail
non-return-to-zero protocol. We describe the functionality
of such a gate. Since all logic functions for the given set of
inputs can be obtained from the the same gate, we call it a
Univeral gate (U-gate).
In the two-input U-gate discussed here, two two-rail inputs and four single rail outputs are present. The output
transitions occur depending on the four possible input combination : 00, 01, 10 and 11; and depending on the input,
one of the four output lines shows a transition. Therefore
t o realise the two-rail AND gate, we identify the output
line corresponding to the 11 input as the output1 rail and
remaining outputs are disjunctively combined using XOR
gates as the outputo rail. The two rail OR gate can also be
realised similarly.
Figure 1. The Two-input Universal (U)

Gate
The two-input U-gate is shown in Figure 1. It is designed using eight XOR and four Muller-C elements [1,6].
Two XORs and one Muller-C element form a block. Four
instances of the block with appropriate feedback gives a
two-input U-gate. Each rail-input is fed to two blocks. Each
block receives two rail inputs and two feedback signals from
U-gate outputs in a manner in which each XOR gate in a
block receives only one input rail signal and one feedback
signal. The outputs of the four Muller-C elements constitute the four output lines of the U-gate.
Assume that transitions occur on input rails A1 and B1.
The XOR gates will pass the transitions to the inputs of
the Muller-C elements. As a result both inputs of MullerC element 3 will f i e thereby resulting in an output event
on Ma. C elements 2 and 1 will also see a transition on
one of their inputs. Since valid inputs were received and a
valid output generated, this state of these two C elements
is not acceptable as an initial state. This is because in
this state, C elements 2 anid 3 see a transition on one of
their inputs and are therefore waiting for a transition on
the other input. So if a transition now occurs on Bo then C
element 2 will fire even if no input is received on the A input
line. This is obviously an error. Rectifying it requires that
the transitioii on the input of C elements 2 and 3 must be
negated by forcing another transition on those lines. This
is done by feeding back ME3 t o the relevant inputs of C
elements 2 and 3 via the XOR gates. So now when M3 fires
another transition occurs at the inputs of the XOR gates
which have AL1or B1 as inputs. These transitions cancel the
earlier transitions. Note thatt M3 is not fed back to Block 3
XOR gates as this would again result in an error. This can
343
7 7
0001 11 10
00 01 11 10
0 0 0 1 0
1 0 1 1 1
Truth Table for the Carryout

andtheoutoperation.
1 1 0 1 0
CanyOut = A . ( B + C ) + B.C
B3 1
-L
= ( A'B + B'A ).C'
+ ( A'B' + A.B ).C
Note the realisation of NOT gate
Figure 2. Full Adder Implementation :

Approach 1

Approach 2
be easily generalised for any of the four input combinations.

The circuit level implementation of the U-gate was verified through SPICE simulations for the minimum feature
size of 1 pm. A delay of approximately three nanoseconds
(typical) was encountered.
The usefulness of the U-gate would not have been much
if only two-rail AND, OR circuits were realised using them,
and these were used to implement Boolean functions. Ugates can be used to implement any Boolean function, however complex. The implementational complexity would be
the same as that encountered in implementing the two-rail
AND gate; more XOR gates will however be required.
Apart from being used to realise Boolean functions, Ugates are also used as registers, MUXes and DEMUXes.
gate generates all the 2" minterms. By disjunctively combining the required minterms, a Boolean function can be
realised. The major stumbling block in this approach is the
complexity of the U-gates. The two-input U-gate has four
blocks, and the outputs are fed back as eight feedbacks.
The three-input U-gate is composed of a two-input U-gate
plus eight blocks, and the total number of feedbacks are 32.
In a four-input U-gate, the total number of feedbacks is 96.
This makes the U-gates with a large number of inputs very
complex and huge. As a result this approach can be used
only if the number of inputs is small.
The third approach is a mixture of the above two methods. It is well known that any n-input Boolean function can
be decomposed and represented in terms of Boolean functions with smaller number of inputs. A function f can be
expressed in terms of two functions f1 and f 2 as under.
4. Minimization of Boolean Expressions
f ( X 1 , 52...9 2")
= fl[fi(zl,zZ,
...,zk),
{zk+l, ...,zn)]
where f1 is a function of (n-k+l) variables.We realise f i and

f 2 to implement f using appropriate U-gates.
In this section we present three ways in which asyn-
chronous combinational logic blocks can be realised using

U-gates.
In the first approach, the given boolean expression is
minimized and the circuit is thereafter expressed using the
AND, OR and the NOT gates. In the implementation stage,
we use the two-rail AND, OR, NOT gates . The resulting circuit will be the two-rail implementation of the given
Boolean expression.
The second approach uses the fact that, any n-input U-
4.1. Full Adder in the Two-Rail Representation

The above approaches are demonstrated using the example of the Full Adder here. All the three implementations
of the Adder are presented and compared. A and B are
the two inputs to the Adder (one bit inputs) and C is the
344
CarryIn to the Full Adder. All are in the two-rail representation.

Figure 2 demonstrates the first approach. Two-rail
AND(&) and OR(+) circuits are utilised to implement the
circuit. Since it is assumed that only two-input U-gates
are available, therefore only two-input AND and OR operations are possible. From the Boolean expressions for Out
and Carryout it seems that the total number of two-input
U-gates required to implement the Full Adder is nine. This
number can, however, be reduced. Note that the two tworail XOR gates have the same set of inputs. Therefore both
can be realised using one U-gate. Similarly, one U-gate
can be reduced in the CarryOut implementation. Thus the
number of two-input U-gates required to implement the one
bit Full Adder is seven. The total number of blocks (2 XOR
and 1 C-element) are 28.
The second approach uses the three-input U-gate. This
is shown in Figure 3. Both the Output and the CarryOut
can be realised using just the one three-input U-gate. The
number of blocks required in this implementation reduces to
12 which is much less than the 28 in the previous case. Also
the total delay will be less in this approach as compared to
the previous one.
The implementation of the one bit Full Adder based on
the third approach is shown in Figure 4. The Truth Tables
of the CarryOut and the Output signal are shown. Twoinput U-gates are used to realise each of the two rows of the
Truth Tables separately. These are then combined to give
the required output by using two other two-input U-gates
and two XOR gates. The four initial U-gates all have A and
B as the inputs and therefore can be replaced by just one
U-gate. Therefore the total number of U-gates equals five
and the total number of blocks 20. This approach is clearly
better than approach 1 in terms of area and the delay.
OUTPUT
7
\Row 1
NjZ.-!
The first and second rows of Out

Truth Table are realised seperately
using two-input U-gate.
When C=O, Row 1 detcrmines the

output and when C=l, Row 2
determines the output.
LY---L.--
CARRYOUT

Approach 3
A1
Signal indicating completion
5. Memory Interface
AI and A0 are the two rail.$of the input A. The output is single rail.
It i s in the form o f a single pulse of duration D.
The memory is assumed to be of the conventional kind

used in synchronous designs. This implies that the inputs
to the memory and the outputs from it will be normal logic
levels. Thus any system designed in the two-rail non-returnto-zero protocol discussed here, will be unable to communicate with the memory in the present form. Interfacing
circuits are therefore required.
Figure 5. The 2tol Converter

the output remains low. i4ko when an input arrives, the
output generated remains valid for a certain fixed amount
of time, after which it returns to its default value. It is important thait during this fiued interval no new inputs come
since otherwise it would result in an error. The circuit implementation is shown in Figure 5 . The output generated
can be latched using an asynchronous register [ 5 ] .
5.1. The 2tol Converter

Before discussing the memory interface circuits it is important to discuss two components of these circuits. These
are the lto2 and the 2tol converters, that is, the conventional to two-rail logic and the two-rail to conventional logic
conversion circuits. Consider the 2tol converter first. The
input is a single bit in the two-rail format and the output is
a single line. If an event occurs on the 1-bit rail of the input
then the output should be high and if an event occurs on
the 0-bit rail then the output should be low. A low output
is the default value, that is, when no input is present then
5.2. The lto2 Converter

In the lto2 converter the input is single rail and the
output generated is two-rail. This is achieved in two steps.
First the single-rail data is converted into two-rail returnto-zero data, that is, if the single-rail data is high then a
high going pulse is genera.ted on the 1-bit rail and if the
single-rail data is low then a high going pulse is generated
345
A-
Single rail input

1-
__ -__
Stage I1
Two-rail output
Two-rail output
in the
in the
return-to-zero protocol non-return-to-zero protocol
RE
r------
Stake 1
Figure 7. Memory Read : Circuit Implementat ion
6. An Example: The Shift Multiplier
Stage I1 of the lto2 converter
Figure 6. The lto2 Converter

on the 0-bit rail. This two-rail data is then fed to the second
stage where the pulse is converted into event. The singlerail input is A. The line labeled C is a control signal which
determines when the conversion should start. The input on
C is a high going pulse the duration of which determines the
duration of the output two-rail pulse of the first stage. The
duration of the pulse should be such that an event occurs
at the output Z of the second stage. An implementation is
shown in Figure 6.
5.3. Memory Interface

The memory supports two operations: Reading from
Memory and Writing to Memory. For the Read operation
the Address of the location is the input and the data in the
location is the output. While Writing, address and data are
the inputs and there are no outputs. Here only the Read
operation is discussed. The Write operation will be similar.
The first step would be to convert the address data
into the single rail, bundled data format. This is achieved
through the 2tol converter discussed above. The next step
would be to inform the memory when valid data is available.
A Read Enable signal must be provided to the memory for
the amount of time the memory needs to output valid data.
Once this is over the lto2 converter must be fed a pulse so
that the conversion takes place. One way of implementing
this is shown in Figure 7. The delay elements D1 and D2
are used to generate the Read Enable signal. The delay of
D1 is fixed by the memory access time while delay of D2
accounts for delay in the lto2 converter.
To integrate and demonstrate all the ideas presented

above, an example is considered here. The example considered is that of a Shift Multiplier [7]. An entirely asynchronous design for the multiplier is implemented in the
two-rail non-return-to-zero protocol. Figure 8 shows the
data path section of the Shift Multiplier and Figure 9, the
control path. The universal character of the U-gate can
be easily observed here. All the data path and the control
path elemcnts are implemented using either the U-gate or
the simpler MullerC and the XOR elements. The design
implemented here was derived from the synthesized synchronous design [7] by a point-to-point translation followed
by removal of the redundant elements.
The basic U-gate, the MullerC element and the Shift
Multiplier circuit designed were simulated in Verilog. Structural as well as behavioral models were developed for The
MullerC element and the U-gate. Different modules, such as
the ADDER and the SHIFTER, were defined and connected
to realise the data path of the Multiplier. The control path
mainly consists of MullerC elements, XOR gates and a Select module, itself implemented using MullerC elements and
XOR gates.
Simulation results for the 4-bit Multiplier considered
here proved that the circuit designed using the methodology presented here works correctly when tested for a variety of input sets. A delay of Ins for XOR gates, 6ns for
U-gate, 3ns for MullerC element and 6 ns for buffer, select
and the lto2 converter was assigned. The worst case delay
(multiplying decimal 15 by 15) was found to be 452ns for
structural case and 500ns for behavioural simulation. Note
that no separate delays were assigned to the wires; however,
slightly larger delays were assigned to the circuital blocks
to account for interconnect delays. A test for complete
delay-insensitivity was conducted by assigning arbitrarily
large and random delays to different modules and interconnections. In every case, the circuit worked correctly thus
proving that this methodology is truly delay-insensitive.
7. Conclusions and Future Work
346
A in
B in
+-
START
J.
out shiftM
shiftM.
To Control Path<
-7OutPut
Figure 9. Control ,Structureof the Shift

Multiplier
Figure 8. Data Path of the Shift Multiplier
Wesley, Reading, Massachusetts, 1980, ch. 7.

In this paper a design methodology for asynchronous
circuits is introduced and demonstrated. It is shown with
the aid of an example that the proposed methodology is
completely delay-insensitive. This is because of the unique
design of the U-gate. Though much inferior to synchronous
methodologies in terms of area, it is our claim that as the
complexity of the circuit increases, the magnitude of difference in the areas will decrease. This is based on the fact
that the U-gate allows the designer more freedom as far as
minimization of boolean expressions is concerned.
One very important question is that of automation of
design. One possibility is, as demonstrated in the Shift
Multiplier example, a point-to-point translation from the
available synchronous designs. However, note that the data
path and the control path can be easily integrated in this
methodology. Whereas, in the synchronous Shift Multiplier, the control path is implemented as a Finite State
Machine, in the asynchronous design it is an extension of
the data path. This can be exploited in the design of circuits. Point-to-point translation cannot fully avail of this.
An automation tool is thus required which gives a circuit
implementation from a given behavioral description. Work
is now being done in developing such a tool.
[Z] Stephen H. Unger.
Hazards, Critical Races, and

Metastability. IEE,E lhnsactions on Computers,
Vol. 44, No. 6, June 1.995, pp 754-768.
[3] Scott Hituck. Asynchronous Design Methodologies :

an Overview. Technical Report 93-05-07Department
of Computer Science and Engineering, University of
Washington.
[4] A. J. Martin.
Compiling communicating processes

into delay-insensitive VLSI circuits. Distributed
Computing, Jan. 1986, pp226-234.
[5] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako, A. Takamura. TITAC : Design of a Quasi-Delay-Insensitive
Microprocessor. IEE:E Design and Test of Computers, Summer 1994.
[6] Ivan E. Sutherland. Micropipelines. Communications 0.f the ACM, Voll. 32, no. 6, June 1989.
[7] D. Gajslki, N. Dutt, A. Wu, and S. Lin. High Level
Synthesis: Introduction to Chip and Systems Design.
Kluwer Academic Puldishers , 1992.
[SI K. Nandla. Design of Asynchronous Digital Circuits.

B. Tech Project Report, Department of Electrical Engineering, Indian In-stitute of Technology, Kanpur,
References
April 1996.
[l] C. L. Seitz. System Timing. Introduction to VLSI

Systems, C. Mead and L. Conway, eds., Addison-
341

New Methodology for Asynchronous Digital Circuit Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Methodology for Asynchronous Digital Circuit Design

Uploaded by

Copyright:

Available Formats

A New Methodology for the Design of Asynchronous Digital Circuits

Department of Electrical Engineering

In synchronous circuits, the global clock provides the

0-8186-7755496$05.00 0 1996 IEEE

lo"hInternational Conference on VLSI Design -January I997

signaling is, for example, used for controlling data (bundled

3. The Universal (U) Gate

Figure 1. The Two-input Universal (U)

Truth Table for the Carryout

= ( A'B + B'A ).C'

+ ( A'B' + A.B ).C

Note the realisation of NOT gate

Figure 2. Full Adder Implementation :

Figure 3. Full Adder Implementation :

be easily generalised for any of the four input combinations.

4. Minimization of Boolean Expressions

where f1 is a function of (n-k+l) variables.We realise f i and

In this section we present three ways in which asyn-

chronous combinational logic blocks can be realised using

4.1. Full Adder in the Two-Rail Representation

CarryIn to the Full Adder. All are in the two-rail representation.

The first and second rows of Out

When C=O, Row 1 detcrmines the

Figure 4. Full Adder Implementation :

Signal indicating completion

The memory is assumed to be of the conventional kind

Figure 5. The 2tol Converter

5.1. The 2tol Converter

5.2. The lto2 Converter

Single rail input

Figure 7. Memory Read : Circuit Implementat ion

6. An Example: The Shift Multiplier

Stage I1 of the lto2 converter

Figure 6. The lto2 Converter

5.3. Memory Interface

To integrate and demonstrate all the ideas presented

7. Conclusions and Future Work

Figure 9. Control ,Structureof the Shift

Figure 8. Data Path of the Shift Multiplier

Wesley, Reading, Massachusetts, 1980, ch. 7.

[Z] Stephen H. Unger.

Hazards, Critical Races, and

[3] Scott Hituck. Asynchronous Design Methodologies :

Compiling communicating processes

[5] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako, A. Takamura. TITAC : Design of a Quasi-Delay-Insensitive

Microprocessor. IEE:E Design and Test of Computers, Summer 1994.

[SI K. Nandla. Design of Asynchronous Digital Circuits.

[l] C. L. Seitz. System Timing. Introduction to VLSI

You might also like