You are on page 1of 31

Finite Word Length Effects

(Number representation in register)

Sugumar D

FINITE word length effect


The Digital Signal Processors have finite width

of the data bus.


The word-length after mathematical operations,
if exceeds the bus width, will have to be
omitted.
This is the source of Serious Errors.
We now discuss attributes that cause such
errors.
First we will discuss about
Number representation and
Quantization error.

1.Number representation in registers

The Binary Number System


In conventional digital computers - integers

represented as binary numbers of fixed length n


An ordered sequence
of binary digits
Each digit x i (bit) is 0 or 1
The above sequence represents the integer value X

Upper case letters represent numerical values or


sequences of digits
Lower case letters, usually indexed, represent
individual digits

Radix of a Number System


The weight of the digit xi is the i th power of 2
2 is the radix of the binary number system
Binary numbers are radix-2 numbers allowed digits are 0,1
Decimal numbers are radix-10 numbers allowed digits are 0,1,2,,9
Radix indicated in subscript as a decimal number
Example:
(101) 10 - decimal value 101
(101)2 - decimal value 5

Range of Representations
Operands and results are stored in registers of
fixed length n - finite number of distinct
values that can be represented within an
arithmetic unit
Xmin ; Xmax - smallest and largest
representable values
[Xmin,Xmax] - range of the representable
numbers
A result larger then Xmax or smaller than Xmin
- incorrectly represented
The arithmetic unit should indicate that the
generated result is in error an overflow indication

Signed-magnitude Representation
Uses the high-order bit to indicate the sign
0 for positive
1 for negative

remaining low-order bits indicate the magnitude of the


value

Signed magnitude representation of +41 and -41


0 0 1 0
32
+

1 0 0 1
+ 8 +
41

1 0 1 0
32

1
-

1 0 0 1
+ 8 +
41

Disadvantage of the Signed-Magnitude


Representation
Operation may depend on the signs of the operands
Example - adding a positive number X and a negative
number -Y :
X+(-Y)
If Y>X, final result is -(Y-X)
Calculation switch order of operands
perform subtraction rather than addition
attach the minus sign
A sequence of decisions must be made, costing
excess control logic and execution time
This is avoided in the complement representation
methods

Complement Representations of
Negative Numbers
Two alternatives -

Radix complement (called two's complement in the

binary system)
Diminished-radix complement (called one's complement
in the binary system)

In both complement methods - positive numbers


represented as in the signed-magnitude method

Advantage of Complement Representation


No decisions made before executing addition or
subtraction
No need to interchange the order of the two
operands

Ones Complement
Ones complement replaced signed magnitude
because the circuitry was too complicated.
Negative numbers are represented in ones
complement form by complementing each bit

even the sign


bit is
reversed

0 0 1 0

1 0 0 1

1 1 0 1

0 1 1 0

each 1 is
replaced
with a 0

each 0 is
replaced
with a 1

Twos Complement
The twos complement form of a negative integer

is created by adding one to the ones complement


representation.
0 0 1 0

1 0 0 1

0 0 1 0

1 0 0 1

1 1 0 1

0 1 1 0

+ 1 = 1 1 0 1

0 1 1 1

Twos complement representation has a single

(positive) value for zero.


The sign is represented by the most significant
bit.
The notation for positive integers is identical to
their signed-magnitude representations.

The Twos Complement


Representation

Representation of Mixed Numbers


A sequence of n digits in a register - not

necessarily representing an integer


Can represent a mixed number with a fractional
part and an integral part
The n digits are partitioned into two - k in the
integral part and m in the fractional part (k+m=n)
The value of an n-tuple with a radix point between
the k most significant digits and the m least
significant digits

is

Fractional Binary Numbers


2i
2i1
4
2
1

bi bi1

b2 b1 b0 . b1 b2 b3
1/2
1/4
1/8

bj

Representation

2j

Bits to right of binary point represent fractional powers

of 2
i
k
b
2

Represents rational number: k


k j

Fractional Binary Number Examples


Value
5.3/4
2.7/8
63/64

Observation

Representation
101.112
10.1112
0.1111112

Divide by 2 by shifting right


Numbers of form 0.1111112 just below 1.0
Use notation 1.0

Limitation

Can only exactly represent numbers of the form x/2k


Other numbers have repeating bit representations
Value
Representation
1/3
1/5
1/10

0.0101010101[01]2
0.001100110011[0011]2
0.0001100110011[0011]2

Fixed Point Representations


Radix point not stored in register - understood to

be in a fixed position between the k most


significant digits and the m least significant digits
These are called fixed-point representations
One bit is used for the sign and the remaining bits
for the magnitude.
Clearly there is a restriction to the numbers which
can be represented.
With 7 bits reserved for the magnitude, the
largest and smallest numbers represented are +127
and 127.
Sign bit (+ve number)

+127
-127

= 0 1 1 1 1 1 1 1
= 1 1 1 1 1 1 1 1
Sign bit (-ve number)

Fixed Point Representations


Things to note:
1. Fixed point numbers are represented as exact.
2. Arithmetic between fixed point numbers is also
exact provided the answer is within range.
3. Division is also exact if interpreted as
producing an integer and discarding any
remainder.

Floating Point
floating point representation consists of
A Sign Bit s
An Exponent e
Mantissa /fraction M or F

In floating point representation, numbers are represented by a sign


bit s, an integer component e, a positive integer mantissa M.
Eg of floating pt.
s M B e or (-1)s F Be

S (1 bit) Exponent (3 e bit) Fraction (4 M or F bit)

e-exact exponent.
B- base, usually 2 or 16.
-E bias : fixed int and machine dependent.
If mantissa is assumed to be 1.xxxxx (thus, one bit of the
mantissa is implied as 1)
This is called a normalized representation

8-bit floating point format (2)


sign
1 bit
0

exponent significand number number


3 bits
base 2
base 10
4 bits
001
1001
1.001x21 2.25

011

1100

1.1 x 23

111

1110

1.11 x 27 224.0

001

1110

1.11 x 2-1 0.875

12.0

Distribution of Floating Point Numbers


e = -1
1.00 X 2^(-1) =
1.01 X 2^(-1) =
1.10 X 2^(-1) =
1.11 X 2^(-1) =

1/2
5/8
3/4
7/8

e=0
1.00 X 2^0 =
1.01 X 2^0 =
1.10 X 2^0 =
1.11 X 2^0 =

1
5/4
3/2
7/4

e=1
1.00 X 2^1 = 2
1.01 X 2^1 = 5/2
1.10 X 2^1= 3
1.11 X 2^1 = 7/2

3 bit mantissa
Exponent 2 bit {-1,0,1}

Father of the Floating point standard

IEEE Standard
754 for Binary
Floating-Point
Arithmetic.
1989
ACM Turing
Award Winner!

Prof. Kahan

www.cs.berkeley.edu/~wkahan/
/ieee754status/754story.html

IEEE Floating Point


Defined by IEEE Std 754-1985
Developed in response to divergence of representations

(Established in 1985 as uniform standard for floating point


arithmetic >> Before that, many idiosyncratic formats )
Portability issues for scientific code
Supported by all major CPUs

Now almost universally adopted


Two representations
Single precision (32-bit)
Double precision (64-bit)
Driven by Numerical Concerns
Nice standards for rounding, overflow, underflow
Hard to make go fast
Numerical analysts predominated over hardware types in
defining standard

IEEE 754 Floating-Point Format

single precision

single: 8 bits
double: 11 bits

single: 23 bits
double: 52 bits

S Exponent Fraction
x (1)S (1 Fraction) 2( Exponent Bias)
31
Sign

30
Biased exponent

23

(-1)s F 2E-127

22

Normalized Mantissa (implicit 23rd bit = 1)

S: sign bit (0 non-negative, 1 negative)


Normalize significand: 1.0 |significand| < 2.0
Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
Significand is Fraction with the 1. restored
Exponent: excess representation: actual exponent + Bias
Ensures exponent is unsigned
Single: Bias = 127; Double: Bias = 1203

2.Quantization error
in
number representation

Quantization
1. Fixed-point: truncation
To truncate a fixed-point number from
(+1) bits to (b+1) bits, we just discard
the least significant (-b) bits. The
truncation error is denoted by

t Q( X ) X
Here Q(X) is the truncated version of the number X. For a positive X, the
error is equal to zero if all bits being discarded are zeros and is largest if all
discarded bits are ones.

(2b 2 ) t 0

Quantization
For a negative X, the truncation error will be different for three different
formats:
1) Sign-Magnitude:
0 2 b 2
t

2) Ones-complement:

0 t 2 b 2

3) Twos-complement:

2 b 2 t 0

Quantization
2. Fixed-point: rounding
In case of rounding, the number is quantized to the nearest quantization
level. The rounding error does not depend on the format used to represent
negative numbers:

1 b
1 b

2 2 r 2 2
2
2

In practice, >> b, therefore, 2- 0 in all expressions considered.

Quantization Noise
Quantization mechanisms: (Fixed Point)
Rounding

Truncation
2s Complement
All Positive Numbers

Truncation
Sign Magnitude
1s Complement

output

input

2b
probability

-2-b/2

2-b/2

2b
error

-2-b

2b/2
-2-b

2-b

Quantization Noise
Quantization mechanisms: (Floating Point)
Rounding

Truncation
Sign Magnitude
1s Complement

Truncation
2s Complement
All Positive Numbers

output

input

2b/2
probability

-2-b

2-b

2b/4

2b/2
error

-2.2-b

-2.2-b

2.2-b

Quantization
3. Floating-point
Considering a floating-point representation

Q X 2E Q M

X 2E M

of a number

Quantization is carried out on the mantissa only in case of floating-point


numbers. Therefore, it is more reasonable to consider the relative error.

Q X X QM M

X
M

In practice, a rounding quantizer can be modeled as follows:

Q X 2 B round X 2 B

You might also like