OSK ASSIGNMENT : Floating Point

Floating-point

· A method of representing real and decimal numbers in a way
that can support a wide range of values.

· Can represent a very large or small positive and negative
number (±1.23 x 10⁸⁸ ~ ±1.23 x 10^-88), as well as zero.

· Floating-point representation:

Ø Sign : 0 = positive & 1 = negative.

Ø Exponent : range of values. Can be very large or very small.

Ø Fraction : digits after decimal point.

Ø Value of a floating-point number = (-1)^S × val(F) × 2^val(E)

· IEEE-754 standardized the computer representation for binary floating-point number.

i. For single-precision floating point standard, it contains 32-bit :

(1-bit)Sign

(8-bit)Exponent

(23-bit)Fraction/ significand

ii. For double-precision floating point standard, it contains 64-bit :

(1-bit)Sign

(11-bit)Exponent

(52-bit)Fraction/ significand

Single & Double Precision Range:

	Single Precision	Double Precision
Smallest positive	2^-126 or 1.175 x 10^-38	2^-1022 or 2.225 x 10^-308
Largest positive	(2- 2^-23)2¹²⁷ or 3.403 x 10³⁸	(2- 2^-52) 2¹⁰²³ or 1.798 x 10³⁰⁸
Actual exponent	-126 ~ +127	-1022 ~ +1023
Decimal Precision	6 significant digits	16 significant digits
Bias	127	1023

Conversion from decimal value to binary :

Example 1 : convert 61.25 from decimal value to binary.

1. Integer part : 61₁₀ = (2⁶ – 1) - 2¹ = 11 1111

- 10

11 1101

2. Decimal part :

a. Multiply the fraction by 2. The left most value obtained will be the first binary value and so on until the end result is 0.

b. If the left most value obtained is 0, multiply result by 2 as usual.

c. If the left most value obtained is 1, the following multiplication disregard the value 1 and continue multiply the fraction by 2 until end result is 0.

0.25 x 2 = 0.50

0.50 x 2 = 1.00

The binary value for the fraction part = .01

3. Answer : 61.25₁₀ = 11 1101.01₂

Example 2 : Convert 18.625 from decimal value to binary.

Step 1. Integer part : 18 = 2⁴ + 2¹ = 1 0000

+ 10

1 0010

2 Step 2. Decimal part :

0.625 x 2 = 1.25

0.25 x 2 = 0.50

0.50 x 2 = 1.00

The binary value for the fraction part = .101

Step 3. Answer : 1 0010.101₂

However, infinite binary fractions can happen if the decimal fraction keep repeats itself and end product 0 is unachievable.

Example : Convert 16.1 from decimal to binary value.

Integer part : 16₁₀ = 1 0000₂

2. Decimal part :

0.10 x 2 = 0.20

0.20 x 2 = 0.40

0.40 x 2 = 0.80

0.80 x 2 = 1.60

0.60 x 2 = 1.20

0.20 x 2 = 0.40 Repitition occurs!!!

Conversion from binary to decimal value :

1. List the power of two from right to left according to the amount of digits given.

2. Place the binary number according to their exponent value.

3. Multiply the binary value with the above power of two and add up all the values together.

Example :

Negative

Positive

2⁶	2⁵	2⁴	2³	2²	2¹	2⁰	2^-1	2^-2
						0	1	1
1	0	1	0	1	1	1	1

a) 0.11₂= 0 x 2⁰ + 1 x 2^-1+ 1 x 2^-2

=0.75₁₀

b) 101 0111.1₂ = 1 x 2⁶+0 x 2⁵+1 x 2⁴+0 x 2³+1 x 2²+1 x 2¹+1 x 2⁰+1 x 2^-1

= 87.5₁₀

Conversion from decimal and binary value to floating-point representation:

If given decimal value, follow the following steps.

If given binary value, skip the first step.

Change decimal value to binary.
Normalized the binary value. (left most number must be 1 digit only. NO ZERO).
Identify the sign.
The biased exponent = exponent + bias.
Changed the biased exponent to binary value. If the number of digits does not reach the require bit, add 0 to the left.
For fraction, if the number of digits does not reach the require bit, add 0 to the right.

Example : -0.3125

Single precision	Double precision
1. Binary = -0.0101₂
2. Normalized = -1.01 x 2^-2
3.Sign = 1
4. Biased exponent = -2 + 127 = 125	Biased exponent = -2 + 1023 = 1021
5. Exponent = 111 1101₂ = 0111 1101₂	Exponent = 11 1111 1100₂ = 011 1111 1100₂
6. Fraction = 01000…(total 23 bit)	Fraction = 01000…(total 52 bit)

Answer:

Single precision :

0111 1101

0100 000……(total 23 bit)

Double precision :

011 1111 1100

0100 000……(total 52 bit)

Conversion from floating-point representation to binary and decimal value :

In general, it’s the opposite way from converting decimal/binary value to floating-point.

Example 1 : Given the following single representation,

1000 0101

0011 1001 000….

a. Identify the sign bit. Value 1 indicates negative sign.

b. Convert the biased exponent to decimal value. 1000 0101₂=133₁₀

c. Actual exponent value = biased exponent – bias =133–127=6 =2⁶

d. For fraction, write 1 and follow by the following fraction value.

Fraction value = 1.0011 1001 000…

· 1 is the hidden value when we doing normalization.

· We can eliminate all the extra 0 at the right.

e. Answer in binary = -(1.0011 1001) x 2⁶ = -(100 1110.01₂)

f. Answer in decimal = -(1 x 2⁶+1 x 2³+1 x 2²+1 x 2¹+1 x 2^-2)= -78.25₁₀

Example 2 : Given the following double representation,

100 0000 0111

0010 0001 000….

§ It’s a positive value.

§ 100 0000 0111₂ =1 x 2¹⁰+1 x 2²+1 x 2¹+1 x 2⁰=1031₁₀

§ Actual exponent =1031–1023 =2⁸

§ Fraction value = 1.0010 0001

§ Answer in binary = 1.0010 0001 x 2⁸ = 1 0010 0001₂

§ Answer in decimal = 1 x 2⁸+1 x 2⁵+1 x 2⁰=289₁₀

Addition & Subtraction for binary value.

1. Express both operand to the same exponent. Shift the smaller number to the right
until it matches the larger exponent.

2. Add / subtract the significands according to the sign bits. Remain the exponent at
the end.

3. Normalize the result.

4. Round the significand if necessary and renormalize if rounding generates a carry.

5. Check for overflow / underflow. (-126 <= exponent <= 127). If yes , exception,
else it’s done.

§ Overflow = a result that is too large to be represented.

§ Underflow = smaller in magnitude than the smallest denormal, but not zero.

For binary value, example : 1.1111 x 2^-2+ 1.0101 x 2^-3

Addition	Subtraction
1.0101 x 2^-3 = 0.10101 x 2^-2
1.11110 + 0.10101 10.10011 Result = 10.10011 x 2^-2	1.11110 - 0.10101 1.01001 Result = 1.01001x 2^-2
Normalize = 1.010011 x 2^-1	Normalize = 1.01001 x 2^-2
U may round off if necessary
Answer = 1.010011 x 2^-1	Answer = 1.01001 x 2^-2

For decimal value, example : 9.95 x 10²+ 0.85 x 10¹

Addition	Subtraction
i. 0.85 x 10¹ = 0.085 x 10²
ii. 9.950 + 0.085 10.035 Result = 10.035 x 10²	9.950 - 0.085 9.865 Result = 9.865 x 10²
iii. Normalize = 1.0035 x 10³	Normalize = 9.865 x 10²
iv. U may round off if necessary
v. Answer = 1.0035 x 10³	Answer = 9.865 x 10²

Multiplication of decimal value

Add up both exponent to find the new exponent.
Multiply both mantissas together and leave the result with the new exponent.
Normalize result.
Round it if necessary.

Example : 1.805 x 10⁸ x 6.22 x 10^-4

a. New exponent = 8 + (-4) = 4

b. 1.805 x 6.22 = 11.2271 x 10⁴

c. Normalize = 1.12271 x 10⁵

d. Round off if necessary = 1.12271 x 10⁵

OSK ASSIGNMENT

Sunday, 21 October 2012

Floating Point

1 comment: