| |
BetterSquareRoot: a fast integer square root routine in 68K
assembler
A square root function has many uses in programming, for example
to
calculate the length of a triangle's hypotenuse. If x and y are
the lengths
of the two other sides of the triangle, the hypotenuse is sqrt(z)
where
z=x*x + y*y. An article in MacTech Vol 14 #1 (1998) discusses
floating-point square root calculation (in PPC assembler). In
graphics and
games programming there is often no need for floating-point calculations,
and in such cases an integer square root function can be substituted,
with
considerable speed advantage. Such a function should return r=int(sqrt(z));
r is therefore the largest integer such that r*r <= z.
Nearly all algorithms for calculating a square root make use
of iterative
improvement. If r is an approximate value then r=(z/r + r)/2
is a better
approximation. When this formula is applied several times in
succession,
the approximation converges towards the true value. Note that
the formula
requires a division, which is notably slow on most processors
(including
680x0 and PPC). Itn the interests of speed, therefore, every
attempt should
be made to provide a good starting value, so that few iterations
are needed.
A good general purpose integer square root routine should
take as its input
argument an unsigned 32-bit integer, representing any value from
0 to
2^32-1 inclusive (0 to 4294967295). The returned root will be
a 16-bit
value in the range 0-65535. By providing a starting value for
the root with
the 4 most significant bits correct, we can attain 16-bit accuracy
with
only two applications of the formula above. Unfortunately, in
a few cases
the result is large by 1 or 2, and so we must test for this outcome
and
correct it.
An outline of the routine BetterSquareRoot:
1. Get a starting value r for the formula.
a. IF z<4096 THEN use a quick method to halve the number of
bits in z.
b. ELSE use a more complicated bit twiddler.
2. Apply the formula twice.
3. If necessary, correct for result-too-large by subtracting
1.
On an iMac, BetterSquareRoot averages out at a very respectable
1.5
microseconds per call, when timed for values of z from 1 to 100
million.
This may be compared with the toolbox routine _FracSqrt, which
admittedly
does a somewhat different calculation but takes an appallingly
long 21
microseconds. Is BetterSquareRoot the last word in speed for
a 68K routine?
Surely not. The code was optimized for execution on a real 680x0
processor,
and the relative timing of some opcodes is very different in
the emulator
on a PowerMac. For example, btst d1,d0 and bset d1,d0 are extremely
slow in
emulation, and it would be faster if these opcodes could be removed
by
rewriting.
Finally, for the ultimate in speed, consider setting up a
table of squares
with 65536 entries (0, 1, 4, 9, 16, 25 ... 4294836225). To find
the square
root of a number z, search the table by binary subdivision to
find the
largest entry <=z. The index of that entry is r. This technique
may not
have wide appeal, because it requires 256KB storage. In contrast,
BetterSquareRoot consists of only 104 bytes of code.
Fantasm 68K listing of BetterSquareRoot:
******************************************************************
* title BetterSquareRoot *
* input "d0 unsigned long integer (0-4294967295)" *
* input "destroys d0,d1,d2" *
* input "does not work on 68000; OK on 60020,030,040, emulator"
*
* output "d0 largest integer r such that r*r <= input
argument" *
* output "d2 is a copy of the original input argument in
d0 " *
******************************************************************
BetterSquareRoot:
move.l d0,d2 ; copy of input z; d0 will become root
beq.s SqDoneN ; skip if 0
cmpi.l #4095,d0
bhi.s SqAccApprox ; bigger values need better approx
SqRoughLoop: ; fast halve number of bits in z
move.l d0,d1
lsr.l #1,d0
lsr.l #2,d1
bne.s SqRoughLoop ; loop until double-shifted d1 is 0
addq.l #1,d0 ; force non-zero
bra.s SqIterate
SqAccApprox:
BFFFO D0{$0:$0},D1 ; BitFieldFindFirstOne (in d0, result in d1)
neg.w d1
addi.w #32,d1 ; 32 - BFFFO result
lsr.w #1,d1 ; numBits/2
bcs.s SqOddNumBits ; branch if numBits is odd
SqEvenNumBits:
lsr.l d1,d0 ; shift out half the bits
subq.w #2,d1
btst d1,d0
bne.s SqIterate ; branch if 2nd msb is 1
subq.w #1,d1
bset d1,d0 ; set the 3rd msb
subq.w #1,d1
bset d1,d0 ; set the 4rd msb
bra.s SqIterate
SqOddNumBits:
lsr.l d1,d0 ; shift out half the bits
subq.w #1,d1
bclr d1,d0 ; test and clear 2nd msb
beq.s SqIterate ; branch if 2nd msb was 0
subq.w #1,d1
bset d1,d0 ; set the 3rd msb
SqIterate: ; iterate r=(z/r + r)/2 twice
move.l d2,d1 ; dividend = z
divu.l d0,d1 ; z/r
add.l d1,d0 ; z/r + r
lsr.l #1,d0 ; shift for divide by 2; new r
move.l d2,d1
divu.l d0,d1
add.l d1,d0
lsr.l #1,d0
SqTestBig:
move.l d0,d1
mulu.w d1,d1 ; r*r
cmp.l d2,d1 ; compare with input
bls.s SqDoneN ; unsigned comparison
subq.l #1,d0 ; adjust to prevent "hi-by-one" error
bra.s SqTestBig ; repeat until OK
SqDoneN:
rts
|