Floating-point
·
A
method of representing real and decimal numbers in a way
that can support a wide range of values.
that can support a wide range of values.
·
Can
represent a very large or small positive and negative
number (±1.23 x 1088 ~ ±1.23 x 10-88), as well as zero.
number (±1.23 x 1088 ~ ±1.23 x 10-88), as well as zero.
·
Floating-point
representation:
Ø Sign : 0 = positive & 1 =
negative.
Ø Exponent : range of values. Can be
very large or very small.
Ø Fraction : digits after decimal
point.
Ø Value of a floating-point number =
(-1)S × val(F) × 2val(E)
·
IEEE-754 standardized the computer representation for binary
floating-point number.
i. For single-precision floating point standard, it contains 32-bit :
i. For single-precision floating point standard, it contains 32-bit :
(1-bit)Sign
|
(8-bit)Exponent
|
(23-bit)Fraction/
significand
|
ii.
For
double-precision floating point standard, it contains 64-bit :
(1-bit)Sign
|
(11-bit)Exponent
|
(52-bit)Fraction/
significand
|
- Single & Double Precision Range:
Single Precision
|
Double Precision
|
|
Smallest positive
|
2-126 or 1.175 x 10-38
|
2-1022 or 2.225 x 10-308
|
Largest positive
|
(2- 2-23) 2127 or 3.403 x 1038
|
(2- 2-52) 21023 or 1.798 x 10308
|
Actual exponent
|
-126 ~ +127
|
-1022 ~ +1023
|
Decimal Precision
|
6 significant digits
|
16 significant digits
|
Bias
|
127
|
1023
|
Conversion from decimal value to binary :
Example 1 :
convert 61.25 from decimal value to binary.
1. Integer
part : 6110 = (26 – 1) - 21 = 11 1111
- 10
11 1101
2. Decimal
part :
a.
Multiply
the fraction by 2. The left most value obtained will be the first binary value
and so on until the end result is 0.
b.
If
the left most value obtained is 0, multiply result by 2 as usual.
c.
If
the left most value obtained is 1, the following multiplication disregard the
value 1 and continue multiply the fraction by 2 until end result is 0.
0.25 x 2 = 0.50
0.50 x 2 = 1.00
The binary value for the fraction part = .01
3.
Answer : 61.2510
= 11 1101.012
Example 2 : Convert 18.625
from decimal value to binary.
Step 1. Integer part : 18 = 24 + 21 = 1 0000
+ 10
1 0010
2 Step 2. Decimal
part :
0.625 x 2 = 1.25
0.25 x 2 = 0.50
0.50 x 2 = 1.00
The binary value for the fraction part = .101
Step 3.
Answer : 1
0010.1012
However, infinite binary fractions can
happen if the decimal fraction keep repeats itself and end product 0 is unachievable.
Example : Convert 16.1
from decimal to binary value.
1.
Integer part : 1610 = 1
00002
|
2. Decimal part :
0.20 x 2 = 0.40
0.40 x 2 = 0.80
0.80
x 2 = 1.60
0.60 x 2 = 1.20
0.20 x 2 = 0.40 Repitition occurs!!!
Conversion from binary to decimal
value :
1.
List
the power of two from right to left according to the amount of digits given.
2.
Place
the binary number according to their exponent value.
3.
Multiply
the binary value with the above power of two and add up all the values together.
Example :
Negative
|
Positive
|
26
|
25
|
24
|
23
|
22
|
21
|
20
|
2-1
|
2-2
|
0
|
1
|
1
|
||||||
1
|
0
|
1
|
0
|
1
|
1
|
1
|
1
|
a)
0.112 =
0 x 20 + 1 x 2-1 + 1 x 2-2
=0.7510
b)
101
0111.12 = 1 x 26+0 x 25+1 x 24+0 x 23+1 x 22+1 x 21+1 x 20+1 x 2-1
= 87.510
Conversion from decimal and binary
value to floating-point representation:
If given decimal value, follow the following steps.
If given binary value, skip the first step.
|
- Change decimal value to binary.
- Normalized the binary value. (left most number must be 1 digit only. NO ZERO).
- Identify the sign.
- The biased exponent = exponent + bias.
- Changed the biased exponent to binary value. If the number of digits does not reach the require bit, add 0 to the left.
- For fraction, if the number of digits does not reach the require bit, add 0 to the right.
Example : -0.3125
Single precision
|
Double precision
|
1. Binary = -0.01012
|
|
2. Normalized = -1.01 x 2-2
|
|
3.Sign = 1
|
|
4. Biased exponent = -2 + 127 = 125
|
Biased exponent = -2 + 1023 = 1021
|
5. Exponent = 111 11012
= 0111 11012
|
Exponent = 11 1111 11002
=
011 1111 11002
|
6. Fraction = 01000…(total 23 bit)
|
Fraction = 01000…(total 52 bit)
|
Answer:
Single
precision :
1
|
0111 1101
|
0100 000……(total 23 bit)
|
Double precision
:
1
|
011 1111
1100
|
0100 000……(total 52 bit)
|
Conversion from floating-point
representation to binary and decimal value :
In
general, it’s the opposite way from converting decimal/binary value to
floating-point.
|
Example 1 :
Given the following single representation,
1
|
1000 0101
|
0011 1001 000….
|
a. Identify
the sign bit. Value 1 indicates negative sign.
b. Convert
the biased exponent to decimal value. 1000 01012=13310
c. Actual exponent value = biased exponent – bias
=133–127=6 =26
d. For
fraction, write 1 and follow by the following fraction value.
Fraction value = 1.0011 1001 000…
·
1
is the hidden value when we doing normalization.
·
We
can eliminate all the extra 0 at the right.
e.
Answer
in binary = -(1.0011 1001) x 26 = -(100 1110.012)
f. Answer
in decimal = -(1 x 26+1 x 23+1 x 22+1 x
21+1 x 2-2)= -78.2510
Example 2 : Given
the following double representation,
0
|
100
0000 0111
|
0010 0001
000….
|
§ It’s a positive value.
§ 100 0000 01112 =1 x 210+1 x 22+1 x 21+1 x 20=103110
§ Actual exponent =1031–1023 =28
§ Fraction value = 1.0010 0001
§ Answer in binary = 1.0010 0001 x 28 = 1
0010 00012
§ Answer in decimal = 1 x 28+1 x 25+1 x 20=28910
Addition & Subtraction for binary
value.
1.
Express
both operand to the same exponent. Shift the smaller number to the right
until it matches the larger exponent.
2.
Add
/ subtract the significands according to the sign bits. Remain the exponent at
the end.
the end.
3.
Normalize
the result.
4.
Round
the significand if necessary and renormalize if rounding generates a carry.
5.
Check
for overflow / underflow. (-126 <= exponent <= 127). If yes , exception,
else it’s done.
else it’s done.
§ Overflow = a result that is too large
to be represented.
§ Underflow = smaller in magnitude than
the smallest denormal, but not zero.
For binary
value, example : 1.1111 x 2-2 + 1.0101 x 2-3
Addition
|
Subtraction
|
1.0101 x 2-3 =
0.10101 x 2-2
|
|
1.11110
+
0.10101
10.10011
Result = 10.10011 x 2-2
|
1.11110
-
0.10101
1.01001
Result = 1.01001x 2-2
|
Normalize
= 1.010011 x 2-1
|
Normalize
= 1.01001 x 2-2
|
U may round off if necessary
|
|
Answer = 1.010011 x 2-1
|
Answer = 1.01001 x 2-2
|
For decimal
value, example : 9.95 x 102+ 0.85 x 101
Addition
|
Subtraction
|
i.
0.85 x 101
= 0.085 x 102
|
|
ii.
9.950
+ 0.085
10.035
Result = 10.035 x 102
|
9.950
-
0.085
9.865
Result =
9.865 x 102
|
iii.
Normalize =
1.0035 x 103
|
Normalize =
9.865 x 102
|
iv.
U may round off
if necessary
|
|
v.
Answer = 1.0035
x 103
|
Answer = 9.865
x 102
|
Multiplication of decimal value
- Add up both exponent to find the new exponent.
- Multiply both mantissas together and leave the result with the new exponent.
- Normalize result.
- Round it if necessary.
Example : 1.805 x 108 x 6.22
x 10-4
a. New exponent = 8 + (-4) = 4
b.
1.805
x 6.22 = 11.2271 x 104
c.
Normalize
= 1.12271 x 105
d.
Round
off if necessary = 1.12271 x 105
Nice data you got there. Well arrange and simple. Good job. =)
ReplyDelete