[SOLVED] CS Floating Point

30 $

File Name: CS_Floating_Point.zip
File Size: 160.14 KB

5/5 - (1 vote)

Floating Point

a Fixedpointrepresentations a Big and Small Numbers
a ScientificNotation
a IEEE754floatingpointstandard Special symbols
Underflow overflow
a Floatingpointadditionandmultiplication a Material from section 3.5 of textbook
Agenda

How to Represent Real Numbers?

Real Numbers a Positionalnotationallowsforfractions
anan-1a1a0 . a-1a-2..a-m
a Lets start with fixed point representation Choose n and m
Radix point is always in the same position
Easy to implement
Limited range

a 152.310
a 1011.012
Real Numbers

Binary to Decimal
a Integersscaledbyanappropriatefactor a Directexpansionwithpositionalweights a 0.110012

Binary to Hexadecimal a Use the same trick as before
0.1101010012
0.2BE16

Decimal to Binary
a Multiply by 2 and note the integer part
a Subtractintegerpartandrepeatuntilnofractionleft
0.62510

Decimal to Binary
a Can all decimal fractions be expressed exactly in Binary? 0.110

How to Represent Small and Big Numbers in Decimal?

How big is Coronavirus?
Particle
Size (meter)
PM10
Red Blood Cell
0.00001 0.000007
PM2.5
0.0000025
Bacteria Coronavirus
0.0000005 0.0000001
Particles filtered by masks
0.000000007

What numbers do we need?
OXIIOOY OXUIUOUY
1.0 10-9
3.15576 109 1.47 1013 2.99792458 1010 6.67300 10-11 1.98892 1030 2.08 1022
S
1.0 10-15
e
Seconds per nanosecond Seconds per century
US National Debt
Speed of light in cm/s Gravitational constant
Mass of sun in kilograms Distance to Andromeda in m Size of a proton in meters

Scientific Notation for Decimal
a Weusescientificnotationforbigandsmallnumbers Use a single digit to the left of the decimal point
Multiplied by base (e.g., 10) raised to some exponent
Use e or E to denote the exponent part
1.0 10-15 1.0e-15 1.0E-15
a Anormalizednumberhasnoleadingzero 1.010 x 10-9 normalized
0.110 x 10-8 not normalized
10.010 x 10-10 not normalized

Scientific Notation for Binary
a How do we represent very small and big numbers in Binary?
a Binary numbers can be written in scientific notation too
1.02 x 2-1 1.12 x 211

How to Represent Floating Points?

Floating Point
a Thebinarypointisnotfixed,butinsteadcanmovebasedon the exponent
Normalized Binary number always has the form:
x is the fraction / significand / coefficient / mantissa y is the exponent
always has a one to the left of the binary point
1.xxxxxxx2 2yyyy

Floating Point Standards
a Manyoptionsforrepresentingfloatingpoint Number of bits for the fraction
Number of bits for the exponent
How to represent zero?
How to represent negative numbers?
a Standardsareimportantforexchangingdata

Floating Point Standards a IEEE 754 used in nearly all computers today
Defines two representations a single precision (32 bits)
a double precision (64 bits)
In high level languages, data of this type is called
a float (single precision)
a double (for double precision)

Single Precision
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
A real number can be described as (-1)S x (1+F) x 2E
a IEEE 754 does not use 2s complement
a Clarification:
Fraction refers to the 23-bit number F
Mantissa refers to the 24-bit number 1+F

a Numbers are in normalized form. Why? Base 2
S Exponent
Mantissa
Single Precision
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
A real number can be described as (-1)S
x (1+F) x 2E
0.0011 20 0.011 2>I
0 0000000
001100
01111111
01111110
011000 110000
0.11 2>O
All equivalent to the same real number. The encoding is wasteful

Biased Notation
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
a In IEEE 754, actual representation is
(-1)S x (1 + Fraction) x 2(Exponent t Bias)
a In single-precision, bias = 127
a Represent negative exponents
a Wanteasyintegerstylecomparison/sorting

Single Precision Floating Point
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
a Largest number?
a Smallestnumber?
a How many numbers can we represent?
(-1)S x (1+F) x 2E

Single Precision Floating Point
a Convert -0.75 from decimal to single precision

Single Precision Floating Point
a ConvertIIIIIIIIIIIIIYIIIIfromsingleprecisionto decimal:

Double Precision
Sign Exponent Fraction 1 bit 11 bits 52 bits
SEF
a More bits!
a More precision
a Double precision uses a bias of 1023
a Can do more before underflow / overflow Approximately 1E-308 to 1E308

Double Precision
a Convert 3.25 from decimal to double precision

Tricky Questions
a What is the largest number that can be represented in single precision?
a What is the smallest number that can be represented in single precision?

Floating Point Arithmetic

a Add the significands
Floating Point Addition
a Align the radix points
Make the smaller number to match the larger
a Normalize the result
What if one number is positive and the other negative? May need to shift a lot!
Check for overflow or underflow when shifting!
a Round so number fits in available digits/bits If bad luck when rounding, renormalize

Floating Point Addition
9.999e1 + 1.610e-1 with 4 digits precision

a Adding exponents
Floating Point Multiplication
a Multiplythesignificands
a Normalize the result (check for overflow)
a Round to fit in available digits/bits Normalize again if necessary
a Compute sign of result
Positive if signs of operands match, negative otherwise

Floating Point Multiplication
1.110e10 times 9.200e-5 with 4 digits precision

Special Cases?

Special symbols
Exponent Fraction Object represented
000
0 1-254 255 255
Nonzero Anything 0 Nonzero
denormalized number floating point number infinity
NaN (Not a Number)

Denormalized Numbers
a The exponent 00000000 is used to represent a set of numbers in the tiny interval ( -2-126, 2-126 )
a This includes the number 0
a Calleddenormalizednumbers
Smallest normalized is 1.0 x 2-126 = 2-126
Smallest denormalized is 0.000 01 x 2-126 = 2-149
a Allows us to squeeze more precision out of a floating point operation
a Tricky to implement. We will come back to this topic later

Unusual events
a Nonzerodividedbyzero
Not the end of the world!
Results in positive or negative infinity
a 0/0(invalid),orsubtractinginfinityfrominfinity Results in NaN
a Notes on NaN
Using NaN in math always results in NaN
Allows us to avoid tests or decisions until a later time in our program

What can go wrong?

Overflow / Underflow
a Largest number that can be represented in single precision:
Approximately 2.0 x 2128 = 2.0 x 1038
a Smallest fraction that can be represented in single precision:
Approximately 2.0 x 2-128 = 2.0 x 10-38
a Overflow: representing a number larger than the one above;
a Underflow: representing a number smaller than the one above

Loss of Precision
https://imgur.com/r/totallynotrobots/lsNcv

}
Compare these for loops
for ( int i = 0; i <= 10; i += 1 ) {System.out.println( i/10f );for ( float y = 0; y <= 1; y += 0.1f ) {}System.out.println( y );Same or different?Questionsa Represent 0.110 in IEEE 754 single precision floating pointa Represent 1.110 in IEEE 754 single precision floating point?Review and more informationa Big and Small Numbersa Scientific Notationa IEEE754floatingpointstandarda Floating point addition and multiplication a Material from Section 3.5 of textbook

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS Floating Point
30 $