Professional Documents
Culture Documents
Sugumar D
Range of Representations
Operands and results are stored in registers of
fixed length n - finite number of distinct
values that can be represented within an
arithmetic unit
Xmin ; Xmax - smallest and largest
representable values
[Xmin,Xmax] - range of the representable
numbers
A result larger then Xmax or smaller than Xmin
- incorrectly represented
The arithmetic unit should indicate that the
generated result is in error an overflow indication
Signed-magnitude Representation
Uses the high-order bit to indicate the sign
0 for positive
1 for negative
1 0 0 1
+ 8 +
41
1 0 1 0
32
1
-
1 0 0 1
+ 8 +
41
Complement Representations of
Negative Numbers
Two alternatives -
binary system)
Diminished-radix complement (called one's complement
in the binary system)
Ones Complement
Ones complement replaced signed magnitude
because the circuitry was too complicated.
Negative numbers are represented in ones
complement form by complementing each bit
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
each 1 is
replaced
with a 0
each 0 is
replaced
with a 1
Twos Complement
The twos complement form of a negative integer
1 0 0 1
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
+ 1 = 1 1 0 1
0 1 1 1
is
bi bi1
b2 b1 b0 . b1 b2 b3
1/2
1/4
1/8
bj
Representation
2j
of 2
i
k
b
2
Observation
Representation
101.112
10.1112
0.1111112
Limitation
0.0101010101[01]2
0.001100110011[0011]2
0.0001100110011[0011]2
+127
-127
= 0 1 1 1 1 1 1 1
= 1 1 1 1 1 1 1 1
Sign bit (-ve number)
Floating Point
floating point representation consists of
A Sign Bit s
An Exponent e
Mantissa /fraction M or F
e-exact exponent.
B- base, usually 2 or 16.
-E bias : fixed int and machine dependent.
If mantissa is assumed to be 1.xxxxx (thus, one bit of the
mantissa is implied as 1)
This is called a normalized representation
011
1100
1.1 x 23
111
1110
1.11 x 27 224.0
001
1110
12.0
1/2
5/8
3/4
7/8
e=0
1.00 X 2^0 =
1.01 X 2^0 =
1.10 X 2^0 =
1.11 X 2^0 =
1
5/4
3/2
7/4
e=1
1.00 X 2^1 = 2
1.01 X 2^1 = 5/2
1.10 X 2^1= 3
1.11 X 2^1 = 7/2
3 bit mantissa
Exponent 2 bit {-1,0,1}
IEEE Standard
754 for Binary
Floating-Point
Arithmetic.
1989
ACM Turing
Award Winner!
Prof. Kahan
www.cs.berkeley.edu/~wkahan/
/ieee754status/754story.html
single precision
single: 8 bits
double: 11 bits
single: 23 bits
double: 52 bits
S Exponent Fraction
x (1)S (1 Fraction) 2( Exponent Bias)
31
Sign
30
Biased exponent
23
(-1)s F 2E-127
22
2.Quantization error
in
number representation
Quantization
1. Fixed-point: truncation
To truncate a fixed-point number from
(+1) bits to (b+1) bits, we just discard
the least significant (-b) bits. The
truncation error is denoted by
t Q( X ) X
Here Q(X) is the truncated version of the number X. For a positive X, the
error is equal to zero if all bits being discarded are zeros and is largest if all
discarded bits are ones.
(2b 2 ) t 0
Quantization
For a negative X, the truncation error will be different for three different
formats:
1) Sign-Magnitude:
0 2 b 2
t
2) Ones-complement:
0 t 2 b 2
3) Twos-complement:
2 b 2 t 0
Quantization
2. Fixed-point: rounding
In case of rounding, the number is quantized to the nearest quantization
level. The rounding error does not depend on the format used to represent
negative numbers:
1 b
1 b
2 2 r 2 2
2
2
Quantization Noise
Quantization mechanisms: (Fixed Point)
Rounding
Truncation
2s Complement
All Positive Numbers
Truncation
Sign Magnitude
1s Complement
output
input
2b
probability
-2-b/2
2-b/2
2b
error
-2-b
2b/2
-2-b
2-b
Quantization Noise
Quantization mechanisms: (Floating Point)
Rounding
Truncation
Sign Magnitude
1s Complement
Truncation
2s Complement
All Positive Numbers
output
input
2b/2
probability
-2-b
2-b
2b/4
2b/2
error
-2.2-b
-2.2-b
2.2-b
Quantization
3. Floating-point
Considering a floating-point representation
Q X 2E Q M
X 2E M
of a number
Q X X QM M
X
M
Q X 2 B round X 2 B