You are on page 1of 4

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO.

1, MARCH 1998

173

A Novel Design of a Two Operand Normalization Circuit


Elisardo Antelo, Montserrat B oo, Javier D. Bruguera, and Emilio L. Zapata

AbstractThis paper presents a new design for two operand normalization. The two operand normalization operation involves the normalization of at least one of two operands by left shifting both by the same amount. Our design performs the computation of the shift by making an OR of the bits of both operands in a tree network, encoding the position of the rst nonzero bit. The encoded position is obtained most signicant bit rst, and then there is an overlapping with the shifting operation. The design we propose replaces two leading zero detector circuits and a comparator, that are present in the conventional approach. Our scheme demonstrates to be more area efcient than the conventional one. The circuit we propose is useful in oating point complex multiplication and COordinate Rotation DIgital Computer (CORDIC) processors. Index Terms Digital VLSI design, oating point operations, leading zero detector circuit, normalization.

I. INTRODUCTION There are many applications where a two operand normalization step is needed [3], [5], [7]. This implies the shift of the two operands until one of them is normalized in a certain range (i.e., [1, 2) for IEEE oating point arithmetic). Typical applications that require this operation are oating point operations over complex numbers like multiplication [7] or addition/subtraction, xed-point CORDIC (COordinate Rotation DIgital Computer) arithmetic for angle calculation in SVD (singular value decomposition algorithm) [4], [5], and the normalization of one operand expressed in carry-save redundant arithmetic, useful in arithmetic coders [1] and to detect the sign of a carry-save operand. Conventional two operand normalization requires leading zero detection, comparison and shifting. In this paper, we present an area efcient design for the two operand normalization problem based on the encoding of the position of the rst nonzero bit from the most signicant side of both operands by means of a tree network of OR operations. As the position of the rst nonzero bit is obtained most signicant bit rst, there is an overlapping with the shift of the operands. Our scheme avoids the comparison of the number of leading zeros of both operands, resulting in a more area efcient implementation maintaining the speed. The structure of the paper is as follows. Section II reviews the conventional two operand normalization. In Section III, we propose the new scheme. In Section IV, we compare both the conventional and the new design. Section V presents two applications, and nally Section VI reports some conclusions. II. CONVENTIONAL TWO OPERAND NORMALIZATION Two operand normalization implies to left shift both operands in such a way that at least one of them has a one (or a zero if the
Manuscript received April 15, 1995; revised April 15, 1996. This work was supported in part by the Ministry of Education and Science (CICYT) of Spain under contract TIC96-1125. E. Antelo, M. B oo, and J. D. Bruguera are with the Department Electr onica e Computaci on of the University of Santiago de Compostela, Santiago de Compostela, Spain. E. L. Zapata is with the Department Arquitectura de Computadores, University of M alaga, M alaga, Spain. Publisher Item Identier S 1063-8210(98)01314-6.

Fig. 1. Conventional two operand normalization.

numbers are negative twos complement) in the most signicant bit position. For a low complexity implementation the left shift can be done in several cycles making a single left shift in each cycle, and then, it is only necessary to check the most signicant bit of both operands in each cycle. This scheme has a very low hardware complexity but requires O (n) cycles for normalization (where n is the wordlength of the operands) and there is a variable number of cycles for normalization (data dependent processing) leading to a more complex control of the whole system. In this paper we consider a more efcient architecture that produces the result in one cycle. Fig. 1 shows a fast architecture for the two operand normalization, in one cycle, of two oating point operands with 24 bit signicants and 8 bits for the common exponent (this scheme is used, for example, in oating point multiplication of complex numbers [7]). Although this is a particular implementation, it is easy to generalize it to any number of bits. Two leading zero detector (LZD) circuits [6] encode the position of the rst non zero bit from the most signicant side in each operand. Then, by means of a comparator the smallest of the encoded values is determined. The position of the rst non zero bit is exactly the shift that has to be performed to normalize at least one of the operands. The encoded position has a binary expression S = d4 d3 d2 d1 d0 . This

10638210/98$10.00 1998 IEEE

174

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998

Fig. 2. New algorithm for two operand normalization.

number is used to left shift both operands and it is also subtracted from the exponent. The binary value of S has log 2 (p) bits in a general case, where p is dened as p = 2dlog (n)e for n bit operands. Observe that p is the largest integer power of two greater or equal to n, and therefore p is always even. The left shift is usually performed by means of a barrel shifter to normalize the operands in one cycle [9]. The barrel shifter that we consider in this paper has to allow hardwired left shifts by p=2; p=4; 1 1 1 ; 1 in different stages (multiplexers). In Fig. 1, we have n = 24 and p = 32. After the normalization process, a rounding of the operands must be carried out for oating point arithmetic. Moreover, hardware for overow/underow detection in oating point arithmetic is needed (not shown in the gure). As shown in Fig. 1, if the signicants are expressed in twos complement, a module is necessary to perform the ones complement of the operands when the sign bit is one, to encode the position of the rst zero. This module is not necessary in the case of sign and magnitude representation for the signicants, like the IEEE standard for oating point arithmetic. The scheme showed in Fig. 1 can be easily adapted to deal with integer values, both unsigned (eliminating the module for the ones complement), and signed twos complement. For this cases the hardware for the exponent and rounding are eliminated. III. A NEW ALGORITHM FOR THE TWO OPERAND NORMALIZATION In this section we describe the new algorithm for two operand normalization. We present an scheme that replaces the two LZD circuits and the comparator present in the conventional architecture. For the sake of simplicity we develop the algorithm for the two operand normalization of integer unsigned numbers, but it can be generalized for twos complement signed integers, and oating point numbers, just replacing the two LZD circuits and the comparator in Fig. 1 by the new scheme. The basic operation of our algorithm is to encode the position of the rst non zero bit from the most signicant side of both

operands by performing an OR operation over the two operands in a tree network. The encoded position is produced msb (most signicant bit) rst, so the shifting operation begins when the most signicant bit of the encoded position is available, and then there is an overlapping between the shift computation and the shifting operation. The algorithm we present has a binary tree structure where a new di (beginning with the msb of S ) is obtained in each decision of the tree. Fig. 2 shows the binary tree that represents our algorithm for the two operand normalization. In this gure we present the algorithm for 16 bit operands, but the structure of the algorithm can be extended easily to any number of bits (not necessary a power of two). Note that in the case of a non power of two wordlength of the operands, the tree would not be symmetrical. We also show in this gure, an example for the normalization of two operands. Note the path followed in the tree for this case to nd the suitable shift. For each decision in the tree we show the result of the OR operation over certain bits, and the suitable value for di . The encoding of the position of the rst non zero bit is as follows. First we check the p=2 (OR operation over 8 bits for p = 16) more signicant bits of both operands. If all the bits are zero, then the position of the rst non zero bit in one of the operands is greater than p=2 and then d3 = 1 (that is a left shift by 2p=2 have to be done). Else d3 = 0 since the position of the rst non zero bit is within the p=2 most signicant bits. To determine the value of d2 , p=4 bits of each operand have to be checked. The position of these p=4 bits is a function of d3 . If d3 = 0, then the p=4 msb have to be inspected. Else, the p=4 bits following the p=2 msb have to be checked. The remaining bits of the encoded position are determined following a similar procedure until d0 is obtained. In the example shown in Fig. 2 we have drawn an square over the bits inspected in each case. The algorithm we propose can be mapped efciently in a combinatorial network. The normalization architecture based on the new algorithm is depicted in Fig. 3 for n = 16 (p = 16). We distinguish two parts, one corresponding to the shift calculation and another to the

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998

175

TABLE I COMPARISON BETWEEN CONVENTIONAL

AND

NEW SCHEME

levels. For these comparisons we have considered the two operand normalization of unsigned integer numbers with n = 32. We now give the components of the critical path of both designs. We have introduced buffers in the lines that control the multiplexers of the barrel shifters due to its large fanout. The critical path in the conventional design is

Tconv = tlzd + tcomp + tbu er + tb0sh

(1)

where tlzd is the delay of an LZD circuit, tcomp is the delay of a comparator, tbu er is the delay of a buffer and tb0sh the delay of a barrel shifter. For the case of our architecture we have

Tnew = tl0or + tl0mux + tbu er + tmux + tor + tbu er + tmux

(2)

Fig. 3. Hardware implementation of the new algorithm.

barrel shifters that perform the shift. For the calculation of the shift we implement the OR of all possible groups of bits shown in Fig. 2, and the correct path in the tree is selected by means of multiplexers controlled by the value obtained for di in each step. To reduce the hardware complexity of the algorithm, bit d0 is computed by checking the most signicant bit of both operands after the log2 (p) 0 1 initial shifts in the barrel. For a general case, the circuit is made up of a tree of OR gates with log2 (p) levels. There are log2 (p) 0 2 levels of multiplexers.

IV. COMPARISON In this section we perform a comparison of our design and the conventional design described in section II. We have made the comparison based on two different approaches. On one hand we have performed a VLSI implementation in 0.7 m CMOS double metal technology of both schemes. On the other hand we have evaluated the architectures based on a model independent of the technology. In this model the hardware complexity is evaluated counting the gates needed for the implementation and the delay is computed in gate

where tl0or is the delay of log2 (p) OR gate levels, tl0mux is the delay of log2 (p) 0 2 2-to-1 multiplexer levels, tmux is the delay of a 2-to-1 multiplexer, and tor is the delay of a two input OR gate. Note that tb0sh (delay of the barrel shifter) does not appear in the expression of the critical path due to the overlapping with the computation of the di bits. Table I shows the area and delay obtained for the VLSI implementation of the conventional and new design. We have performed a standard-cells based implementation (ES2 ECPD07 [2]) with the DFWII (Cadence) design tool. As we can see both schemes have similar speed, but the area devoted to the shift computation in the conventional design is 2.91 times the area devoted to the same function in our design, and the total area, including the barrel shifters, has been reduced by a factor 1.17 with respect to the conventional approach. For the technology independent evaluation we have assumed that an x-input gate (nand, nor) has a delay of x=2 and complexity of x=2 gates [8]. For the inverters we have considered 0.5 gate delays and 0.5 gates of complexity. For the buffers we have assumed a delay of 1 + m=8 gate delays, where m is the number of cells driven by the buffer. We have considered a complexity of 5 gates for the buffer. These are reasonable assumptions for many technologies, although there are many other alternatives when considering a particular technology. Table I shows the result of this evaluation. The number of gate levels are similar, but the gate count for the shift computation is reduced by factor 2.8 in favor of our design. Although one of the comparisons is technology dependent and the other one is very rough, the results obtained indicate that the new design reduces the area requirements (or gate count) maintaining the same speed as the conventional one. Observe the signicant saving in area (or gates) in the shift computation part. V. APPLICATIONS In this section we briey discuss two interesting applications for the scheme we propose, although many others are possible. 1) Floating Point Multiplication of Complex Numbers: In digital signal processing (DSP), many signals are efciently represented as complex numbers. To increase the dynamic range of

176

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998

these signals, a oating point representation is used, leading to oating point operations in digital signal processors. In [7], a oating point multiplier for complex numbers is presented. The oating point representation of the complex numbers consist in two signicants with their own sign bits, and one common exponent for both real and imaginary parts. This representation proves to be efcient from the point of view of memory requirements, and the processing is more efcient since only one normalization and rounding process is necessary at the end. After the multiplication both the real and imaginary parts of the result can be denormalized, due to cancelation in the additions, and a normalization and rounding process is needed. Since the real and imaginary parts have a common exponent, the numbers are assumed to be normalized to the greater of two, so a two operand normalization is necessary. The scheme we propose in this work can be used efciently in this structure reducing the hardware requirements for normalization, maintaining the speed of the circuit. Since the complex number multiplier is a key hardware element in DSP, the proposed circuit for two operand normalization can be widely used in this eld. Our scheme for normalization can also be used in oating point addition/subtraction of complex numbers. 2) Angle Calculation with CORDIC: This is an iterative algorithm to perform plane rotations and to evaluate trigonometric functions [4], [5]. The basic iteration is based on shift and addition operations. Many CORDIC-based algorithms have been proposed for signal processing, image processing, matrix algebra and robotics [4]. In [5], it has been shown that if the input vector is not normalized, then there can be large errors in the computation of the inverse tangent function (vectoring mode). This situation occurs in the evaluation of the SVD of a matrix based on xed-point CORDIC arithmetic [5]. For the computation of the SVD, xed-point format can be used without loss of accuracy. This way the overhead due to oatingpoint processing is avoided. However as the processing is in xed-point format, the processors that evaluate the inverse tangent function have to incorporate, internally, a two operand normalization circuit to avoid errors. VI. CONCLUSIONS We have presented a novel design for two operand normalization. The new architecture is based on the computation of the shift from the most signicant bit to the least signicant one. This technique allows signicant savings in area because it is not necessary to detect the number of leading zeros of the two operands and then to perform a comparison as in the conventional design. The main advantage of our architecture is the low cost in area as compared to the conventional one but maintaining time performance. The proposed circuit can be used efciently in oating point operations over complex numbers and CORDIC processors, although other applications are possible. ACKNOWLEDGMENT The authors thank the referees for their valuable comments. REFERENCES
[1] D. Chevion, E. D. Karnin, and E. Walach, High efciency multiplication free approximation of arithmetic coding, in Proc. Data Compression Conf., 1991, pp. 4352.

[2] European Silicon Structures,in ES2 ECPD07 Library Databook, July 1993. [3] D. Goldberg, What every computer scientist should know about oating-point arithmetic, ACM Computing Surv., vol. 23, no. 1, pp. 547, Mar. 1991. [4] Y. H. Hu, CORDIC-based VLSI architecture for digital signal processing, IEEE Signal Process. Mag., pp. 1635, July 1992. [5] K. Kota and J. R. Cavallaro, Numerical accuracy and hardware tradeoffs for CORDIC arithmetic for special purpose processors, IEEE Trans. Comput., vol. 42, no. 7, pp. 769779, July 1993. [6] V. G. Oklobdzija, An algorithmic and novel design of a leading zero detector circuit: Comparison with logic synthesis, IEEE Trans. VLSI Syst., vol. 2, pp. 124128, Mar. 1994. [7] V. G. Oklobdzija, D. Villeger, and T. Soulas, An integrated multiplier for complex numbers, J. VLSI Signal Processing, no. 7, pp. 213222, 1994. [8] H. R. Srinivas and K. Parhi, A fast radix-4 division algorithm and its architecture, IEEE Trans. Comput., vol. 44, no. 6, pp. 826831. [9] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, 2nd ed. Reading, MA: Addison-Wesley, 1993.

Dynamic Fault Dictionaries and Two-Stage Fault Isolation


Paul G. Ryan and W. Kent Fuchs

Abstract This paper presents dynamic two-stage fault isolation for sequential random logic very large scale integrated (VLSI) circuits, and introduces limited and dynamic fault dictionaries. In the rst stage of the dynamic process, a limited fault dictionary identies candidate faults, which are further distinguished in the second stage by a dictionary generated dynamically for the candidate faults and a subset of the test vectors. This provides high resolution but avoids the costs of full static dictionaries. Two-stage fault isolation is evaluated for benchmark circuits and on defects in industrial circuits. Index Terms CAD, diagnosis, dynamic fault dictionaries, fault isolation, testing.

I. INTRODUCTION Fault dictionaries have effectively located real defects in industrial circuits [1], [2], but their use for diagnosis of large, sequential, random logic very large scale integrated (VLSI) circuits has been limited by size and cost-related problems. A full dictionary [1][3] contains a record for each test vector of the errors a circuits modeled faults caused in simulation. Matching algorithms isolate defects by comparing errors seen on a tester to those recorded in the dictionary, identifying the most likely modeled faults [4]. Typically, this is followed by a physical examination to verify and analyze the defect. Integrated computer-aided design (CAD) tools have made it possible to investigate a set of most likely defects.
Manuscript received September 15, 1995; revised September 15, 1997. This work was supported in part by the Semiconductor Research Corporation (SRC) under Grant 93-DP-109, by the Joint Services Electronics Program (JSEP) under Grant N00014-90-J-1270, and by the Intel Corporation. This paper was presented in part at the IEEE Asian Test Symposium, 1993. P. G. Ryan is with Intel Corporation SC9-09, Santa Clara, CA 95052 USA. W. K. Fuchs is with the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907 USA. Publisher Item Identier S 1063-8210(98)01316-X.

10638210/98$10.00 1998 IEEE

You might also like