I am looking at another division algorithm entirely that doesn't always do 16 iterations but I'm not sure how much of an improvement that would be.
; retVal = regArg1/regArg2 (remainder in regArg1)
;adapted from ted rossin's work and a 6502 example oct 7, 2013
;now put the divisor in a work area on the stack and point regarg2 at it as X
glo regarg2 ;divisor
glo sp ;now transfer the work area address to regarg2
sex regarg2 ;and make regarg2 the X register
glo regArg1 ;shift dividend and result left one
glo retVal ;shift the carry bit into the remainder
;now see if the divisor fits in the remainder
inc regarg2 ;point back at low byte
phi rwork ;save partial subtraction result in case we need it
dec regarg2 ;point to high byte of divisor
smb ;subtract it from the remainder
bnf $$divskip ;didn't fit so carry on shifting unless we're done
phi retval ;first, save the top byte of the new remainder
ghi rwork ;now get the low byte back
plo retval ;put it in the remainder
inc regarg1 ;increment the result which is growing in regarg1
dec rwork ;now see if we've done the 16 bits
bnz $$DivLoop16_162b ;back for more if necessary
inc sp ;all done so restore the stack and shuffle some registers
sex sp ;make sure X is sp
;here I have the quotient in regArg1 and remainder in retVal
;return with quotient in retVal and remainder in regArg2
Date: Sat, 5 Oct 2013 17:04:47 -0400
Subject: [cosmacelf] The Rhinestone Compiler Passes 22 Dhrystones/Sec
By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.
The biggest improvements relate to support routines rather than straightline code.
I had already tweaked the multiply routine to exit more quickly for small multipliers. Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds. For example:
;Int_3_Loc = 5 * Int_1_Loc
ld2 R13,'O',sp,(84) [loads the local variable into Reg 13]
cpy2 R15,R13 [so R15=variable*1]
shl2I R13,1 [R13=variable*2]
shl2I R13,1 [R13=variable*4]
alu2 R15,R15,R13,add,adc [now R15=variable*5]
There's a penalty of code size, of course, this is a good 20 bytes bigger than the original. Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger. Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones. In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.
I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.
A good case for the peephole is something like the following
shrU2I %1,8 [compiler wants to shift an int right 8 bits]
<is replaced with>
ghi %1 [move top byte to low and zero the top]
There are a bunch of things like that but they just don't show up that often.