- Published on

# Integer and floating point operations

- Authors
- Name
- Jenny Kim

# Integer Representation and Operation

## Size and range of types

`char - short - int- long

## Overflow

- Unsigned :
`x + y < x / x - y > x`

- Signed:
`p + p = n / n + n = p`

- When comparing signed with unsigned, signed is changed to unsigned and compared.

## Shift

- logical: shift in 0's
- arithmetic: replicate MSB
- Diving numbers in the 2's complement system causes rounding to the next smallest integer, not towards 0 as desired.

## Biasing in division by shifting

- Add
`2^k-1`

if`x < 0`

- Then shift

## Binary Multiplication

- Multiplying two (n)-bit numbers yields at most a (2n) bit product.
- When signed → must sign extend partial products (out to 2n bits).

## Binary Division

- Dividing two (n)-bit numbers may yield an (n)-bit quotient and (n)-bit remainder.

# Floating Point

## Floating Point, Base 10

```
1.2345 * 10^{exp}
```

- Bias = 4
- Stored as 12345[exp] (ex.123459 = 1.2345*10^5)
**Not associative**

## Fixed Point, Base 2

- Radix point assumed to be in a fixed location for all numbers.
- Floating points allows the radix point to be in a different location for each value.

## Floating Point, Base 2

```
+- b.bbb * 2^{+- exp}
```

`[sign] b.[frac] * 2^[exp]`

- Normalized FP format:
`+- 1.bbbbbb * 2^{+- exp}`

- Floating-point numbers are always normalized.
- The 1. is not stored but assumed since we always will store normalized numbers.

## IEEE 754 Floating Point Formats

### Excess-N Exponent Representation

- Instead of 2's complement
- So that comparisons x < y are simple.

### Single Precision (32-bit)

`float`

in C- 1 sign bit
- 8 exponent bits
- range of exponent = -126 to +127
- value = stored -
`127`

- 23 fraction bits
- Equivalent decimal range: 7 digits * 10^38
- s(1) | exp(8) | fraction(23)

### Double Precision (64-bit)

`double`

in C- 1 sign bit
- 11 exponent bits
- value = stored - 1023

- 52 fraction bits
- Equivalent decimal range: 16 digits * 10^-308
- s(1) | exp(11) | fraction(52)

## Special Values

`float`

doesn't wrap around like`int`

## Denormalized

`0 00000001 0000..0 is (1.0) * 2^-126 == 2^-126 (norm)`

`0 00000000 1000..0 is (0.1) * 2^-126 == 2^-127 (denorm)`

`0 00000000 0100..0 is (0.01) * 2^-126 == 2^-128 (denorm)`

- Q. What exponent value is used by denormalized 32-bit floating-point numbers?
- A: -126

## 12-bit "IEEE Short" Format

- 1 sign bit, 5 exponent bits (excess 15), 6 fraction bits

## Rounding

### Round to Nearest, Half to Even

`10...0 : round to even`

`1x...x : round up`

`0x...x : round down`

### Round towards 0 (chopping)

## Rounding Implementation -check

- Guard bits: bits immediately after LSB of fraction
- Round bit: bit to the right of the guard bits
- Sticky bit: Logical OR of all other bits after Guard & R bits.

## FP Addition/Subtraction

**Not associative!!!**- Add similar, small magnitude numbers first

## FP Multiplication/Division

**Not associative - order matters!!!**- Doesn't distribute over addition