On 10/08/2013 10:26 AM, bill rowe wrote:
> The Rhinestone Edition of LCC1802 has now soared past the 23
> Dhrystones/second mark with a score of 23.75. The most recent big
> improvement came from replacing the 16 bit division routine. I've been
> using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I
> never *really* understood the algorithms. I had a hard hard look at the
> division routine and re-did it to make better use of the 1802. The
> routine repeatedly subtracts the divisor from a shifted dividend so I
> was able to leave the divisor in storage rather than repeatedly sourcing
> it from a register and storing it each time - the inspiration came from
> looking at a 6502 implementation. Division is still a painful 399
> instructions excluding call/return overhead but that's down from 588 the
> old way. There's still some register shuffling at the end that can
> maybe be cleaned up but this is a major win.
This tickled a memory so I dug into the archive and found the code for
32 bit non-restoring division that I sent you.
After hacking it down to 16 bits I compared it to your code.
The shift is effectively identical:
dloop: inc TEMPP ; Adjust index to point at LSB
glo Quo ; shift quotient left
ldn TEMPP ; shift remainder
inc TEMPP ; back to LSB
as is the subtraction:
glo Div ; do subtraction
The difference is that this is a non-restoring algorithm. Instead of
performing the subtraction and then backing out of it, the NR algorithm
performs either an addition or a subtraction at each step. The code for
glo Quo ; check lsb
Plus an additional branch at the end of the subtraction code.
Add three more instructions for the loop counter and that is 24
instructions/loop if a subtraction is performed (because of the extra
branch) or 23 if addition is used. Call it 23.5. 16 loops gives 376
instructions. Plus some more at the end to shift in the last bit of the
quotient (6) and to correct the remainder as required. 3 to check plus 6
more to correct if needed. For a total of 388 instructions on average.
Your code uses 23 instructions/loop if the restore can be skipped and 27
if it can't. Call it 25/loop * 16 is 400.
So the non-restoring version is slightly faster.
After looking at the two versions for a bit I think I can shave another
instruction per loop off that by reversing the subtract operation. Thus
saving one instruction to adjust the pointer. Make that 372 instructions
David W. Schultz
Returned for Regrooving