The magical world of PowerPC floating points
This is a basic introduction to PPC floats. I'll assume you know about the 32 integer registers that all have 32 bits in them (yeah, they are sexy). And I'll also assume you know about a few of the basic integer and branch instructions.
There are 32 registers together with a large chunk of processing silicon that most programmers of the PowerPC are not currently using. Even people who write compiler programs are probably not using them. In fact it takes an entirely different way of programming. Let me explain.
In the past, most programmers could not assume that the floating point unit was part of the standard architecture. Even accepting this, most of the time when you did want to use it, it was as really strange bit of silicon, that you ended up waiting for. So, few programmers who wanted really fast stuff used it.
Now, at last, the PowerPC supplies a standard part of the processor environment. This means to use it you have to include floating numbers in your code as useful entities.
The big bummer about this is portability (both in assembler and high level languges) of 'high speed' code. To optimise your code you must choose your floating point usage very carefully, and hence won't convert to other processor too well. Take the 68040 - since most '040's are LC's (i.e. no co-processor) that means that you have to redesign the algothrim. But then again you want the fastest code don't you? And the extra free cycles the PowerPC floating unit can provide are just a little too tempting...
The floating point unit has thirty-two 64 bit registers. These can contain either single-precision (32 bit) or double-precision (64 bit) floating-point values. BIG NOTE: The actual floating point register ALWAYS contains floating-point values in double-precision format. When it is said to contain single-precision values, this means rounding has been done to the double-precision number in some way. When using single-precision instructions, conversions (rounding/expansions) are automatically performed at the start and end of the operation. However, if you put double-precision data through a single-precision instruction - watch out - garbage may result!
The actual registers are just binary registers - like the integer registers. In fact you can store standard integer type data within them. The difference lies in the instructions you may use to access them. Since you cannot perform any integer arithmetic with them, only floating point, storing integer data within them is pretty limited. You can, however, use them to transfer 64 bit data. Don't be fooled however - most of the work for this type of transfer is done by the integer unit, and you only gain slightly in terms of speed.
Why use single-precision? Well, it takes up only four bytes of storage rather than eight, and on most processors is faster in the floating-pont unit. Most of time time the extra precision isn't needed in most programs. (When dealing with single-precision data, the 's' suffix must be appended to floating-point instructions. )
In memory, we deal with the single and double-precision formats directly. In this one we are looking at basically predefined data. But the principles are, of course, the same for data stored from floating-point registers. (I'll use the short hand s.p. for single-precision and d.p. for double-precision). There are three parts to both s.p. and d.p. formats:-
The sign bit, which is 1 for negative, and 0 for positive.
The exponent bits, which gives the magitude of the number. For s.p. this is 8 bits, d.p. is 11 bits.
The magnitude bits, which give the actual number (which will be scaled by the exponent). For s.p. this is 23 bits, and for d.p. this is 52 bits.
In the nitty-gritty of the system, the exponent is always biased from a center number. This is know as a biased exponent. For a s.p. this number is 127, for d.p. this is 1023. So, on a s.p. system, 126 is an exponent of -1. The magnitude is, in a normalised number (more about denormalised numbers later), is ways converted to a 1.fraction format, and the shifting upwards to achieve this obviously modifies the exponent.
The equation that defines that these relationships are:
(-1)^sign . 1.fraction . 2^exponent
Hence to encode the number 3.141592654...
You must first convert this into binary. This is best done in two parts: the integer and the fraction..
3 converts into 11
0.141592654 converts into 0.001001000011111101101 (ish) (you convert by multiplying by 2, or writing a basic program).
So... 11.001001000011111101101 is 3.141592...ish
This must be shifted down one (hence increased the exponent by one).
So... the sign bit is 0, the exponent is 128 and the fraction is 0.1001001000011111101101 (missing the first one off).
This becomes 0 10000000 10010010000111111011010... for s.p. format.
dc.w 0x40490fda
Phew! Got it? If not, I'll put the basic program I use on the web site...
All the floating point arithmetic is based up a muliply-add operation. With this, you can add, multiply, subtract, divide, absolute, negate and compare. You can also store and load, as well as, round and convert.
The basic instructions are: fabs (floating point absolute), fcmpo (floating-point compare ordered), fcmpu (floating-point compare unordered), fmr (floating-point move register), fnabs (floating point negative absolute), fneg (floating-point negative), frsp (floating round to single precision).
The specific double-precision instructions are: fadd (floating-point add), fdiv (floating-point divide), fmadd (floating-point multiply-add), fmsub, fmul, fnmadd (floating-point negative multiply-add), fnmsub, fsub.
The equivalent single precision instructions are: fadds, fdivs, fmadds, fmsubs, fmuls, fnmadds, fnmsubs, fsubs.
There are also the conversion instructions: fctiw(floating point convert to integer word), fctwiz (as previous with round toward zero).
I won't cover the store floating (stf??) and load floating (lf??) instructions, as these are very close to their single precision counterparts.
Normalised and Denormalised Numbers
We have covered normalised numbers above in section 2. To recap, these are numbers that have an intrinsic 1 before the fraction, and this is scaled by the exponent to produce the result. There is however, a smallest number that can be respresented by this format - i.e. 1 x 2^-126 in s.p. values.
You can however respresent smaller numbers by removing the need for an intrinsic '1' infront of the numbers. These are 'Denormalised numbers'. The format is...
(-1)^sign . 0.fraction . 2^(1-bias).
Normal (very small) Denormalised numbers have a biased exponent of 0. Since the format of floating-point numbers (within registers) cannot support larger 'denormalised numbers' (i.e. without an intrinsic 1.fraction format), this is always the case.
Infinities
These are values that have the maximum allowable biased exponent, 255 in s.p. format, and 2047 in d.p. format and a zero fraction value. They are simply used to represent number greater than can be produced with normalised numbers. You can have both positive and negative infinities.
Arithmetic on infinities is always exact, and does not signal any exceptions, except when an invalid operation occurs, such as subtraction of infinities, division of infinity by inifinity, multiplication of infinity by zero. Please consult your textbooks(!)
Zero
A zero have a biased exponent of zero, and a fraction of zero. In this way, you may think of them as an extension of denormalised numbers in some respects.
Zero can be both positive and negative, but this makes no effect on operations, such as compare, which regards +0 = -0.
Note, division of zero by zero will cause an invalid operation exception.
Not a Number (NaNs)
These are represented by a maximum biased exponent, and a non-zero fraction value. (Compare this with the definition for 'Infinities').
These are basically ways of representing special cases where values are either not valid numbers.
There are two types of Nan's. Signalling Nan's and Quiet Nan's. Quiet Nan's will silently propagate through most floating point operations. Signalling Nan's are never created by floating-point operations, and must be created manually, for instance, to identify uninitialised memory.
The high-order bit of the fraction part of a Nan signals whether it is Quiet (set to 0) or Signalling (set to 1).
Invalid Operations and Exceptions
I won't say a lot about these, apart from to repeat what I said above (Please consult your textbooks) and to say that not all floating exceptions cause the typical 'call the system' response, and you do have some control over what happens. See the contents of the FPSCR.
Rounding
Just a short section to say you can control how numbers are rounded in four ways: Round to Nearest, Round toward Zero, Round toward +Infinity, Round toward -Infinity. You do this with part of the FPSCR.
Interweaving
To get the greatest throughput of the processor, obviously you must use all the different units at once - this basically means the branch unit, the integer unit and the floating unit. This is not a problem with the branch unit, but takes some planning with the integer and floating unit. This isn't helped with the following...
Int to Float Conversion and Back Again
There are two instructions to convert from double-precision to integer word, fctiw (floating convert to integer word) and fctiwz (floating convert to integer word, round to zero (truncate) mode). They store the result in floating point registers, hence you must use memory to put the into integer registers...
transfer: dc.w 0,0 ; spare storage; previouslylwz r8,transfer(rtoc); the actual conversionfctiw f0,f1 ; convertstfd f0,(r8) ; storelwz r3,4(r8) ; load into integer registerThis is the conversion from integer to float...conv_store: dc.w 0x43300000,0x0 ; temporarydc.w 0x43300000,0x8000000 ; this is zero; previouslylwz r8,conv_store(rtoc); the actual conversionxoris r3,r3,0x8000 ; invert sign of intstw r3,temp+4(r8)lfd f0,(r8)lfd f1,8(r8)fsub f0,f0,f1
From this, you may (when you see the timings below) may look like a lot to convert from one to another and back again - and, yes it is. The best way is to avoid as many conversions as possible in real time code.
These are for the 601, but gives some ideas on saving. Please note these are the longest time they take in one unit, hence how long they take to execute. Things aren't really this simple (they occupy mulitple stages in the pipe for longer than this), but for comparison this is pretty good. Please note, this figure is one more than the stall. So, fmul actually stalls the next instruction 1 cycle.
| Integer | Longest Time | S.P. | Longest Time | D.P. | Longest Time |
| mulli | 5 | ||||
| addw | 1 | fadds | 1 | fadd | 1 |
| subw | 1 | fsubs | 1 | fsub | 1 |
| mullw (a/b) | 5/9 | fmuls | 1 | fmul | 2 |
| mulhwu | 5/9/10 | ||||
| divw | 36 | fdivs | 17 | fdiv | 31 |
| fmadds | 1 | fmadd | 2 | ||
| General | 1 | General | 1 | General | 1 |
| store/load (c) | 1 | store/load (d) | 1 | store/load (d) | 1 |
NOTES: (a=high 16 bits of rB are all sign bits, b=other data)
c=not muliple!, d=uses integer unit!
This has only been a flying summary. Perhaps if there is more interest then I will elaborate with some actual code and algorithms. In my humble opnion, if either (a) you are multiplying, (b) you are dealing with non-precise data, (c) you require more speed, or (d) could easily insert floating values instead of integer ones for a limited subset of instructions - Then you should seriously look at floating point code within you program.
RP10/96