BetterSquareRoot: a fast integer square root routine in 68K assembler

by Robert Purves

A square root function has many uses in programming, for example to
calculate the length of a triangle's hypotenuse. If x and y are the lengths
of the two other sides of the triangle, the hypotenuse is sqrt(z) where
z=x*x + y*y. An article in MacTech Vol 14 #1 (1998) discusses
floating-point square root calculation (in PPC assembler). In graphics and
games programming there is often no need for floating-point calculations,
and in such cases an integer square root function can be substituted, with
considerable speed advantage. Such a function should return r=int(sqrt(z));
r is therefore the largest integer such that r*r <= z.

Nearly all algorithms for calculating a square root make use of iterative
improvement. If r is an approximate value then r=(z/r + r)/2 is a better
approximation. When this formula is applied several times in succession,
the approximation converges towards the true value. Note that the formula
requires a division, which is notably slow on most processors (including
680x0 and PPC). Itn the interests of speed, therefore, every attempt should
be made to provide a good starting value, so that few iterations are needed.

A good general purpose integer square root routine should take as its input
argument an unsigned 32-bit integer, representing any value from 0 to
2^32-1 inclusive (0 to 4294967295). The returned root will be a 16-bit
value in the range 0-65535. By providing a starting value for the root with
the 4 most significant bits correct, we can attain 16-bit accuracy with
only two applications of the formula above. Unfortunately, in a few cases
the result is large by 1 or 2, and so we must test for this outcome and
correct it.

An outline of the routine BetterSquareRoot:
1. Get a starting value r for the formula.
a. IF z<4096 THEN use a quick method to halve the number of bits in z.
b. ELSE use a more complicated bit twiddler.
2. Apply the formula twice.
3. If necessary, correct for result-too-large by subtracting 1.

On an iMac, BetterSquareRoot averages out at a very respectable 1.5
microseconds per call, when timed for values of z from 1 to 100 million.
This may be compared with the toolbox routine _FracSqrt, which admittedly
does a somewhat different calculation but takes an appallingly long 21
microseconds. Is BetterSquareRoot the last word in speed for a 68K routine?
Surely not. The code was optimized for execution on a real 680x0 processor,
and the relative timing of some opcodes is very different in the emulator
on a PowerMac. For example, btst d1,d0 and bset d1,d0 are extremely slow in
emulation, and it would be faster if these opcodes could be removed by
rewriting.

Finally, for the ultimate in speed, consider setting up a table of squares
with 65536 entries (0, 1, 4, 9, 16, 25 ... 4294836225). To find the square
root of a number z, search the table by binary subdivision to find the
largest entry <=z. The index of that entry is r. This technique may not
have wide appeal, because it requires 256KB storage. In contrast,
BetterSquareRoot consists of only 104 bytes of code.

Fantasm 68K listing of BetterSquareRoot:

******************************************************************
* title BetterSquareRoot *
* input "d0 unsigned long integer (0-4294967295)" *
* input "destroys d0,d1,d2" *
* input "does not work on 68000; OK on 60020,030,040, emulator" *
* output "d0 largest integer r such that r*r <= input argument" *
* output "d2 is a copy of the original input argument in d0 " *
******************************************************************
BetterSquareRoot:
move.l d0,d2 ; copy of input z; d0 will become root
beq.s SqDoneN ; skip if 0
cmpi.l #4095,d0
bhi.s SqAccApprox ; bigger values need better approx

SqRoughLoop: ; fast halve number of bits in z
move.l d0,d1
lsr.l #1,d0
lsr.l #2,d1
bne.s SqRoughLoop ; loop until double-shifted d1 is 0
addq.l #1,d0 ; force non-zero
bra.s SqIterate

SqAccApprox:
BFFFO D0{$0:$0},D1 ; BitFieldFindFirstOne (in d0, result in d1)
neg.w d1
addi.w #32,d1 ; 32 - BFFFO result
lsr.w #1,d1 ; numBits/2
bcs.s SqOddNumBits ; branch if numBits is odd

SqEvenNumBits:
lsr.l d1,d0 ; shift out half the bits
subq.w #2,d1
btst d1,d0
bne.s SqIterate ; branch if 2nd msb is 1
subq.w #1,d1
bset d1,d0 ; set the 3rd msb
subq.w #1,d1
bset d1,d0 ; set the 4rd msb
bra.s SqIterate

SqOddNumBits:
lsr.l d1,d0 ; shift out half the bits
subq.w #1,d1
bclr d1,d0 ; test and clear 2nd msb
beq.s SqIterate ; branch if 2nd msb was 0
subq.w #1,d1
bset d1,d0 ; set the 3rd msb

SqIterate: ; iterate r=(z/r + r)/2 twice
move.l d2,d1 ; dividend = z
divu.l d0,d1 ; z/r
add.l d1,d0 ; z/r + r
lsr.l #1,d0 ; shift for divide by 2; new r

move.l d2,d1
divu.l d0,d1
add.l d1,d0
lsr.l #1,d0

SqTestBig:
move.l d0,d1
mulu.w d1,d1 ; r*r
cmp.l d2,d1 ; compare with input
bls.s SqDoneN ; unsigned comparison
subq.l #1,d0 ; adjust to prevent "hi-by-one" error
bra.s SqTestBig ; repeat until OK
SqDoneN:
rts

[68K/PPC Articles | FTP | Main]