Sunday, 21 October 2012

Floating Point

Floating-point
      ·        A method of representing real and decimal numbers in a way
         that can support a 
wide range of values.
      ·        Can represent a very large or small positive and negative
         number 
(±1.23 x 1088 ~ ±1.23 x 10-88), as well as zero.


    ·        Floating-point representation:
Ø Sign : 0 = positive & 1 = negative.
Ø Exponent : range of values. Can be very large or very small.
Ø Fraction : digits after decimal point.
Ø Value of a floating-point number = (-1)S × val(F) × 2val(E)

          ·        IEEE-754 standardized the computer representation for binary floating-point number.
         
 i.            For single-precision floating point standard, it contains 32-bit :
(1-bit)Sign
(8-bit)Exponent
(23-bit)Fraction/ significand

                              ii.            For double-precision floating point standard, it contains 64-bit :
(1-bit)Sign
(11-bit)Exponent
(52-bit)Fraction/ significand

  • Single & Double Precision Range:

Single Precision
Double Precision
Smallest positive
2-126 or 1.175 x 10-38
2-1022 or 2.225 x 10-308
Largest positive
(2- 2-23) 2127 or 3.403 x 1038
(2- 2-52) 21023 or 1.798 x 10308
Actual exponent
-126 ~ +127
-1022 ~ +1023
Decimal Precision
6 significant digits
16 significant digits
Bias
127
1023

Conversion from decimal value to binary :
Example 1 : convert 61.25 from decimal value to binary.
       1.         Integer part : 6110 = (26 – 1) - 21 =  11 1111
                                                                   -       10
                                                                   11 1101

       2.    Decimal part :

a.     Multiply the fraction by 2. The left most value obtained will be the first binary value and so on until the end result is 0.
b.     If the left most value obtained is 0, multiply result by 2 as usual.
c.      If the left most value obtained is 1, the following multiplication disregard the value 1 and continue multiply the fraction by 2 until end result is 0.
0.25 x 2 = 0.50
0.50 x 2 = 1.00
The binary value for the fraction part = .01

  3.     Answer : 61.2510 = 11 1101.012

       Example 2 : Convert 18.625 from decimal value to binary.
  
   Step 1.     Integer part : 18 = 24 + 21 =    1 0000
                                                          +      10  
                                                            1 0010

2          Step 2.     Decimal part :
0.625 x 2  =   1.25
0.25 x 2    =   0.50
0.50 x 2    =   1.00

The binary value for the fraction part = .101

   Step 3.     Answer : 1 0010.1012

However, infinite binary fractions can happen if the decimal fraction keep repeats itself and end product 0 is unachievable.

Example : Convert 16.1 from decimal to binary value.
1.    


Integer part : 1610 = 1 00002
2.     Decimal part :

0.10 x 2 = 0.20
0.20 x 2 = 0.40
0.40 x 2 = 0.80
                                         0.80 x 2 = 1.60                         
0.60 x 2 = 1.20             
                           0.20 x 2 = 0.40    Repitition occurs!!!

Conversion from binary to decimal value :

1.     List the power of two from right to left according to the amount of digits given.
2.     Place the binary number according to their exponent value.
3.     Multiply the binary value with the above power of two and add up all the values together.

Example  :

Negative

Positive
 

26
25
24
23
22
21
20
2-1
2-2
0
1
1
1
0
1
0
1
1
1
1

a)     0.112 = 0 x 20 + 1 x 2-1 + 1 x 2-2
                      =0.7510

         b)    101 0111.12 = 1 x 26+0 x 25+1 x 24+0 x 23+1 x 22+1 x 21+1 x 20+1 x 2-1
                            = 87.510

Conversion from decimal and binary value to floating-point representation:
     



If given decimal value, follow the following steps.
If given binary value, skip the first step.





  1. Change decimal value to binary.
  2. Normalized the binary value. (left most number must be 1 digit only. NO ZERO).
  3. Identify the sign.
  4. The biased exponent = exponent + bias.
  5. Changed the biased exponent to binary value. If the number of digits does not reach the require bit, add 0 to the left.
  6. For fraction, if the number of digits does not reach the require bit, add 0 to the right.    


Example : -0.3125
                                              
Single precision
Double precision
1. Binary = -0.01012
2. Normalized = -1.01 x 2-2
3.Sign = 1
4. Biased exponent = -2 + 127 = 125
Biased exponent = -2 + 1023 = 1021
5. Exponent = 111 11012
                     = 0111 11012
Exponent = 11 1111 11002
                 = 011 1111 11002
6. Fraction = 01000…(total 23 bit)
Fraction = 01000…(total 52 bit)

Answer:

Single precision :
1
0111   1101
0100  000……(total 23 bit)

Double precision :
1
011  1111  1100
0100  000……(total 52 bit)


Conversion from floating-point representation to binary and decimal value :

In general, it’s the opposite way from converting decimal/binary value to floating-point.
 





Example 1 : Given the following single representation,

1
1000   0101
0011  1001  000….

    a.   Identify the sign bit. Value 1 indicates negative sign.
    b.   Convert the biased exponent to decimal value. 1000 01012=13310
    c.   Actual exponent value = biased exponent – bias =133–127=6 =26
    d.    For fraction, write 1 and follow by the following fraction value.
    Fraction value = 1.0011 1001 000…
         ·        1 is the hidden value when we doing normalization.
         ·        We can eliminate all the extra 0 at the right.
    e.     Answer in binary = -(1.0011 1001) x 26 = -(100 1110.012)
    f.      Answer in decimal = -(1 x 26+1 x 23+1 x 22+1 x 21+1 x 2-2)= -78.2510

Example 2 : Given the following double representation,

0
100   0000  0111
0010  0001  000….

§  It’s a positive value.
§  100 0000 01112 =1 x 210+1 x 22+1 x 21+1 x 20=103110
§  Actual exponent =1031–1023 =28
§  Fraction value = 1.0010 0001
§  Answer in binary = 1.0010 0001 x 28 = 1 0010 00012
§  Answer in decimal = 1 x 28+1 x 25+1 x 20=28910

Addition & Subtraction for binary value.

      1.     Express both operand to the same exponent. Shift the smaller number to the right
           until it matches the larger exponent.
      2.     Add / subtract the significands according to the sign bits. Remain the exponent at
           the end.
      3.     Normalize the result.
      4.     Round the significand if necessary and renormalize if rounding generates a carry.
      5.     Check for overflow / underflow. (-126 <= exponent <= 127). If yes , exception, 
           else it’s done.
  §  Overflow   = a result that is too large to be represented.
  §  Underflow = smaller in magnitude than the smallest denormal, but not zero.

For binary value, example : 1.1111 x 2-2 + 1.0101 x 2-3  

Addition
Subtraction
1.0101 x 2-3 = 0.10101 x 2-2
      1.11110
  +  0.10101
 10.10011        
  Result = 10.10011 x 2-2
     1.11110
  -  0.10101
  1.01001        
  Result = 1.01001x 2-2
Normalize = 1.010011 x 2-1
Normalize = 1.01001 x 2-2
U may round off if necessary
Answer = 1.010011 x 2-1
Answer = 1.01001 x 2-2


For decimal value, example : 9.95 x 102+ 0.85 x 101

Addition
Subtraction
                  i.            0.85 x 101 = 0.085 x 102
       ii.               9.950
+ 0.085
 10.035
Result = 10.035 x 102
  9.950
- 0.085
  9.865
Result = 9.865 x 102
    iii.            Normalize = 1.0035 x 103
Normalize = 9.865 x 102                                       
                     iv.            U may round off if necessary
     v.            Answer = 1.0035 x 103
Answer = 9.865 x 102

Multiplication of decimal value
  1. Add up both exponent to find the new exponent.
  2. Multiply both mantissas together and leave the result with the new exponent.
  3. Normalize result.
  4. Round it if necessary.
Example : 1.805 x 108  x  6.22 x 10-4

      a.     New exponent = 8 + (-4) = 4
           b.     1.805 x 6.22 = 11.2271 x 104
           c.      Normalize = 1.12271 x 105
           d.     Round off if necessary = 1.12271 x 105






1 comment:

  1. Nice data you got there. Well arrange and simple. Good job. =)

    ReplyDelete