Finite Word Length Effects

Finite Word Length Effects
(Number representation in register)
Sugumar D
FINITE word length effect

The Digital Signal Processors have finite width
of the data bus.

The word-length after mathematical operations,
if exceeds the bus width, will have to be
omitted.
This is the source of Serious Errors.
We now discuss attributes that cause such
errors.
First we will discuss about
Number representation and
Quantization error.
1.Number representation in registers
The Binary Number System

In conventional digital computers - integers
represented as binary numbers of fixed length n

An ordered sequence
of binary digits
Each digit x i (bit) is 0 or 1
The above sequence represents the integer value X
Upper case letters represent numerical values or

sequences of digits
Lower case letters, usually indexed, represent
individual digits
Radix of a Number System

The weight of the digit xi is the i th power of 2
2 is the radix of the binary number system
Binary numbers are radix-2 numbers allowed digits are 0,1
Decimal numbers are radix-10 numbers allowed digits are 0,1,2,,9
Radix indicated in subscript as a decimal number
Example:
(101) 10 - decimal value 101
(101)2 - decimal value 5
Range of Representations
Operands and results are stored in registers of
fixed length n - finite number of distinct
values that can be represented within an
arithmetic unit
Xmin ; Xmax - smallest and largest
representable values
[Xmin,Xmax] - range of the representable
numbers
A result larger then Xmax or smaller than Xmin
- incorrectly represented
The arithmetic unit should indicate that the
generated result is in error an overflow indication
Signed-magnitude Representation
Uses the high-order bit to indicate the sign
0 for positive
1 for negative
remaining low-order bits indicate the magnitude of the

value
Signed magnitude representation of +41 and -41

0 0 1 0
32
+
1 0 0 1
+ 8 +
41
1 0 1 0
32
1
-
1 0 0 1
+ 8 +
41
Disadvantage of the Signed-Magnitude

Representation
Operation may depend on the signs of the operands
Example - adding a positive number X and a negative
number -Y :
X+(-Y)
If Y>X, final result is -(Y-X)
Calculation switch order of operands
perform subtraction rather than addition
attach the minus sign
A sequence of decisions must be made, costing
excess control logic and execution time
This is avoided in the complement representation
methods
Complement Representations of
Negative Numbers
Two alternatives -
Radix complement (called two's complement in the
binary system)
Diminished-radix complement (called one's complement
in the binary system)
In both complement methods - positive numbers

represented as in the signed-magnitude method
Advantage of Complement Representation

No decisions made before executing addition or
subtraction
No need to interchange the order of the two
operands
Ones Complement
Ones complement replaced signed magnitude
because the circuitry was too complicated.
Negative numbers are represented in ones
complement form by complementing each bit
even the sign

bit is
reversed
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
each 1 is
replaced
with a 0
each 0 is
replaced
with a 1
Twos Complement
The twos complement form of a negative integer
is created by adding one to the ones complement

representation.
0 0 1 0
1 0 0 1
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
+ 1 = 1 1 0 1
0 1 1 1
Twos complement representation has a single
(positive) value for zero.

The sign is represented by the most significant
bit.
The notation for positive integers is identical to
their signed-magnitude representations.
The Twos Complement

Representation
Representation of Mixed Numbers

A sequence of n digits in a register - not
necessarily representing an integer

Can represent a mixed number with a fractional
part and an integral part
The n digits are partitioned into two - k in the
integral part and m in the fractional part (k+m=n)
The value of an n-tuple with a radix point between
the k most significant digits and the m least
significant digits
is
Fractional Binary Numbers

2i
2i1
4
2
1
bi bi1
b2 b1 b0 . b1 b2 b3
1/2
1/4
1/8
bj
Representation
2j
Bits to right of binary point represent fractional powers
of 2
i
k
b
2
Represents rational number: k

k j
Fractional Binary Number Examples

Value
5.3/4
2.7/8
63/64
Observation
Representation
101.112
10.1112
0.1111112
Divide by 2 by shifting right

Numbers of form 0.1111112 just below 1.0
Use notation 1.0
Limitation
Can only exactly represent numbers of the form x/2k

Other numbers have repeating bit representations
Value
Representation
1/3
1/5
1/10
0.0101010101[01]2
0.001100110011[0011]2
0.0001100110011[0011]2
Fixed Point Representations

Radix point not stored in register - understood to
be in a fixed position between the k most

significant digits and the m least significant digits
These are called fixed-point representations
One bit is used for the sign and the remaining bits
for the magnitude.
Clearly there is a restriction to the numbers which
can be represented.
With 7 bits reserved for the magnitude, the
largest and smallest numbers represented are +127
and 127.
Sign bit (+ve number)
+127
-127
= 0 1 1 1 1 1 1 1
= 1 1 1 1 1 1 1 1
Sign bit (-ve number)
Fixed Point Representations

Things to note:
1. Fixed point numbers are represented as exact.
2. Arithmetic between fixed point numbers is also
exact provided the answer is within range.
3. Division is also exact if interpreted as
producing an integer and discarding any
remainder.
Floating Point
floating point representation consists of
A Sign Bit s
An Exponent e
Mantissa /fraction M or F
In floating point representation, numbers are represented by a sign

bit s, an integer component e, a positive integer mantissa M.
Eg of floating pt.
s M B e or (-1)s F Be
S (1 bit) Exponent (3 e bit) Fraction (4 M or F bit)
e-exact exponent.
B- base, usually 2 or 16.
-E bias : fixed int and machine dependent.
If mantissa is assumed to be 1.xxxxx (thus, one bit of the
mantissa is implied as 1)
This is called a normalized representation
8-bit floating point format (2)

sign
1 bit
0
exponent significand number number

3 bits
base 2
base 10
4 bits
001
1001
1.001x21 2.25
011
1100
1.1 x 23
111
1110
1.11 x 27 224.0
001
1110
1.11 x 2-1 0.875
12.0
Distribution of Floating Point Numbers

e = -1
1.00 X 2^(-1) =
1.01 X 2^(-1) =
1.10 X 2^(-1) =
1.11 X 2^(-1) =
1/2
5/8
3/4
7/8
e=0
1.00 X 2^0 =
1.01 X 2^0 =
1.10 X 2^0 =
1.11 X 2^0 =
1
5/4
3/2
7/4
e=1
1.00 X 2^1 = 2
1.01 X 2^1 = 5/2
1.10 X 2^1= 3
1.11 X 2^1 = 7/2
3 bit mantissa
Exponent 2 bit {-1,0,1}
Father of the Floating point standard
IEEE Standard
754 for Binary
Floating-Point
Arithmetic.
1989
ACM Turing
Award Winner!
Prof. Kahan
www.cs.berkeley.edu/~wkahan/
/ieee754status/754story.html
IEEE Floating Point

Defined by IEEE Std 754-1985
Developed in response to divergence of representations
(Established in 1985 as uniform standard for floating point

arithmetic >> Before that, many idiosyncratic formats )
Portability issues for scientific code
Supported by all major CPUs
Now almost universally adopted

Two representations
Single precision (32-bit)
Double precision (64-bit)
Driven by Numerical Concerns
Nice standards for rounding, overflow, underflow
Hard to make go fast
Numerical analysts predominated over hardware types in
defining standard
IEEE 754 Floating-Point Format
single precision
single: 8 bits
double: 11 bits
single: 23 bits
double: 52 bits
S Exponent Fraction
x (1)S (1 Fraction) 2( Exponent Bias)
31
Sign
30
Biased exponent
23
(-1)s F 2E-127
22
Normalized Mantissa (implicit 23rd bit = 1)
S: sign bit (0 non-negative, 1 negative)

Normalize significand: 1.0 |significand| < 2.0
Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
Significand is Fraction with the 1. restored
Exponent: excess representation: actual exponent + Bias
Ensures exponent is unsigned
Single: Bias = 127; Double: Bias = 1203
2.Quantization error
in
number representation
Quantization
1. Fixed-point: truncation
To truncate a fixed-point number from
(+1) bits to (b+1) bits, we just discard
the least significant (-b) bits. The
truncation error is denoted by
t Q( X ) X
Here Q(X) is the truncated version of the number X. For a positive X, the
error is equal to zero if all bits being discarded are zeros and is largest if all
discarded bits are ones.
(2b 2 ) t 0
Quantization
For a negative X, the truncation error will be different for three different
formats:
1) Sign-Magnitude:
0 2 b 2
t
2) Ones-complement:
0 t 2 b 2
3) Twos-complement:
2 b 2 t 0
Quantization
2. Fixed-point: rounding
In case of rounding, the number is quantized to the nearest quantization
level. The rounding error does not depend on the format used to represent
negative numbers:
1 b
1 b
2 2 r 2 2
2
2
In practice, >> b, therefore, 2- 0 in all expressions considered.
Quantization Noise
Quantization mechanisms: (Fixed Point)
Rounding
Truncation
2s Complement
All Positive Numbers
Truncation
Sign Magnitude
1s Complement
output
input
2b
probability
-2-b/2
2-b/2
2b
error
-2-b
2b/2
-2-b
2-b
Quantization Noise
Quantization mechanisms: (Floating Point)
Rounding
Truncation
Sign Magnitude
1s Complement
Truncation
2s Complement
All Positive Numbers
output
input
2b/2
probability
-2-b
2-b
2b/4
2b/2
error
-2.2-b
-2.2-b
2.2-b
Quantization
3. Floating-point
Considering a floating-point representation
Q X 2E Q M
X 2E M
of a number
Quantization is carried out on the mantissa only in case of floating-point

numbers. Therefore, it is more reasonable to consider the relative error.
Q X X QM M
X
M
In practice, a rounding quantizer can be modeled as follows:
Q X 2 B round X 2 B

Finite Word Length Effects

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finite Word Length Effects

Uploaded by

Copyright:

Available Formats

Finite Word Length Effects

(Number representation in register)

FINITE word length effect

of the data bus.

1.Number representation in registers

The Binary Number System

represented as binary numbers of fixed length n

Upper case letters represent numerical values or

Radix of a Number System

remaining low-order bits indicate the magnitude of the

Signed magnitude representation of +41 and -41

Disadvantage of the Signed-Magnitude

Radix complement (called two's complement in the

In both complement methods - positive numbers

Advantage of Complement Representation

even the sign

is created by adding one to the ones complement

Twos complement representation has a single

(positive) value for zero.

The Twos Complement

Representation of Mixed Numbers

necessarily representing an integer

Fractional Binary Numbers

Bits to right of binary point represent fractional powers

Represents rational number: k

Fractional Binary Number Examples

Divide by 2 by shifting right

Can only exactly represent numbers of the form x/2k

Fixed Point Representations

be in a fixed position between the k most

Fixed Point Representations

In floating point representation, numbers are represented by a sign

S (1 bit) Exponent (3 e bit) Fraction (4 M or F bit)

8-bit floating point format (2)

exponent significand number number

1.11 x 2-1 0.875

Distribution of Floating Point Numbers

Father of the Floating point standard

IEEE Floating Point

(Established in 1985 as uniform standard for floating point

Now almost universally adopted

IEEE 754 Floating-Point Format

Normalized Mantissa (implicit 23rd bit = 1)

S: sign bit (0 non-negative, 1 negative)

In practice, >> b, therefore, 2- 0 in all expressions considered.

Quantization is carried out on the mantissa only in case of floating-point

In practice, a rounding quantizer can be modeled as follows:

You might also like