Loading ...
Sorry, an error occurred while loading the content.

14491RE: [cosmacelf] 23 and counting with much improved division

Expand Messages
  • bill rowe
    Oct 11, 2013
      Thanks very much David.  I had not thought to look at that.

      I am currently experimenting with yet another division algorithm that is quicker for smaller operands but worse for larger numbers of bits(and negative numbers).  For the division in dhrystone the operands are generally both less than ten and this cuts the calculation down to under 100 instructions(!).  It's a bit cheat-y but the worst cases are under 450 instructions so I may leave this as the default algorithm.  It's a bit messy to read my code but the paper ref'd in the credits is clear. Interestingly, I did not try to leave any of the operands on the stack since they all change during the calculation but I note that your non-restoring algorithm does leave one on the stack even though it gets shifted.  I may see if that would help here.

      The multiplication algorithm follows and again it's at its best with small positive numbers which is pretty real-world friendly.  It's much simpler and shorter than the division but the source code link is given for reference:

      ;this is a divisor shifting algorithm which is faster for smaller operands
      ;credit to http://research.microsoft.com/pubs/70645/tr-2008-141.pdf
      _divu2: ;retval=regarg1(x or dividend)/regarg2(y or divisor)
      ;uses R14(rwork) as temp, R10 to hold remainder, R11 to save original divisor y0
      testsub: macro reg1,reg2 ;test subtraction of reg2 from reg1
      ;result in D, rwork.hi, DF
      glo reg2
      str sp
      glo reg1
      phi rwork
      ghi reg2
      str sp
      ghi reg1
      cpy2 R10,regarg1
      ld2z retval
      cpy2 R11,regarg2
      dec sp
      testsub regarg1,regarg2
      bnf $$computequot ;DF=0 means it didn't fit
      phi regarg1 ;regarg1=regarg1-regarg2
      ghi rwork
      plo regarg1
      $$again: ;this is the divisor doubling phase
        testsub regarg1,regarg2
        bnf $$computequot ;df=0 means it didn't fit
      phi regarg1 ;regarg1=regarg1-regarg2
      ghi rwork
      plo regarg1 
        shl2 regarg2 ;y=y+y
        br $$again
       $$computequot: ;here we're computing the quotient
        testsub R10,regarg2
        bnf $$testexit
        phi R10 ;complete the subtraction
        ghi rwork
        plo R10
        inc retval
        ghi R11
        sm ;top of regarg2 is still on stack
        bnz $$ney0y
        glo regarg2
        str sp
        glo R11
        sm ;test low order bytes
        bz $$out ;if = we're done
        shl2 retval ;double quotient
        shrU2 regarg2 ;halve divisor
        br $$computequot ;continue
        inc sp ;release work area
       ;here the quotient is in retval, remainder in R10
        cpy2 regarg2,R10
      cretn ;and we're done

      ;16 bit right shifting multiply which is faster for smaller operands
      ; credit to http://map.grauw.nl/articles/mult_div_shifts.php#lrmultr
      _mulU2: ;retval(product)=regarg1(multiplicand)*regarg2(multiplier)
      ld2z retval
      shru2 regarg2
      bnf $$mulrnoadd
      ; bottom bit of multiplier is one so add multiplicand to product
      alu2 retval,retval,regarg1,add,adc
      ghi regarg2 ;check for all bits of multiplier shifted out
      bnz $$mulrshft ;nope, continue
      glo regarg2 ;check bottom byte
      bz $$mulrdone
      shl2 regarg1 ;shift multiplicand left 1
      br $$mulrlp
      $$mulrdone: ;here the product is in retval

      > To: cosmacelf@yahoogroups.com
      > From: david.schultz@...
      > Date: Thu, 10 Oct 2013 19:01:51 -0500
      > Subject: Re: [cosmacelf] 23 and counting with much improved division
      > On 10/08/2013 10:26 AM, bill rowe wrote:
      > > The Rhinestone Edition of LCC1802 has now soared past the 23
      > > Dhrystones/second mark with a score of 23.75. The most recent big
      > > improvement came from replacing the 16 bit division routine. I've been
      > > using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I
      > > never *really* understood the algorithms. I had a hard hard look at the
      > > division routine and re-did it to make better use of the 1802. The
      > > routine repeatedly subtracts the divisor from a shifted dividend so I
      > > was able to leave the divisor in storage rather than repeatedly sourcing
      > > it from a register and storing it each time - the inspiration came from
      > > looking at a 6502 implementation. Division is still a painful 399
      > > instructions excluding call/return overhead but that's down from 588 the
      > > old way. There's still some register shuffling at the end that can
      > > maybe be cleaned up but this is a major win.
      > This tickled a memory so I dug into the archive and found the code for
      > 32 bit non-restoring division that I sent you.
      > After hacking it down to 16 bits I compared it to your code.
      > The shift is effectively identical:
      > dloop: inc TEMPP ; Adjust index to point at LSB
      > glo Quo ; shift quotient left
      > shlc
      > plo Quo
      > ghi Quo
      > shlc
      > phi Quo
      > ldn TEMPP ; shift remainder
      > shlc
      > stxd
      > ldn TEMPP
      > shlc
      > str TEMPP
      > inc TEMPP ; back to LSB
      > as is the subtraction:
      > glo Div ; do subtraction
      > sd
      > stxd
      > ghi Div
      > sdb
      > str TEMPP
      > The difference is that this is a non-restoring algorithm. Instead of
      > performing the subtraction and then backing out of it, the NR algorithm
      > performs either an addition or a subtraction at each step. The code for
      > this is:
      > glo Quo ; check lsb
      > ani 1
      > bdf doadd
      > Plus an additional branch at the end of the subtraction code.
      > Add three more instructions for the loop counter and that is 24
      > instructions/loop if a subtraction is performed (because of the extra
      > branch) or 23 if addition is used. Call it 23.5. 16 loops gives 376
      > instructions. Plus some more at the end to shift in the last bit of the
      > quotient (6) and to correct the remainder as required. 3 to check plus 6
      > more to correct if needed. For a total of 388 instructions on average.
      > Your code uses 23 instructions/loop if the restore can be skipped and 27
      > if it can't. Call it 25/loop * 16 is 400.
      > So the non-restoring version is slightly faster.
      > After looking at the two versions for a bit I think I can shave another
      > instruction per loop off that by reversing the subtract operation. Thus
      > saving one instruction to adjust the pointer. Make that 372 instructions
      > on average.
      > --
      > David W. Schultz
      > http://home.earthlink.net/~david.schultz
      > Returned for Regrooving
      > ------------------------------------
      > ========================================================
      > Questions? Check the FAQ at http://www.cosmacelf.com/forumfaq.html
      > Visit the COSMAC ELF website at http://www.cosmacelf.comYahoo! Groups Links
      > <*> To visit your group on the web, go to:
      > http://groups.yahoo.com/group/cosmacelf/
      > <*> Your email settings:
      > Individual Email | Traditional
      > <*> To change settings online go to:
      > http://groups.yahoo.com/group/cosmacelf/join
      > (Yahoo! ID required)
      > <*> To change settings via email:
      > cosmacelf-digest@yahoogroups.com
      > cosmacelf-fullfeatured@yahoogroups.com
      > <*> To unsubscribe from this group, send an email to:
      > cosmacelf-unsubscribe@yahoogroups.com
      > <*> Your use of Yahoo! Groups is subject to:
      > http://info.yahoo.com/legal/us/yahoo/utos/terms/
    • Show all 7 messages in this topic