Loading ...
Sorry, an error occurred while loading the content.

1448623 and counting with much improved division

Expand Messages
  • bill rowe
    Oct 8, 2013
    • 0 Attachment

      LCC1802 is an ANSI C Compiler for the 1802 that runs on windows.  It is feature complete and supports 16 and 32 bit ints and floating point.  The current version is called St. Judy's Compiler and can be found at https://sites.google.com/site/lcc1802/   Source code is available on Google Code https://code.google.com/p/lcc1802/

      The Rhinestone Edition of LCC1802 has now soared past the 23 Dhrystones/second mark with a score of 23.75.  The most recent big improvement came from replacing the 16 bit division routine.  I've been using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I never *really* understood the algorithms.  I had a hard hard look at the division routine and re-did it to make better use of the 1802.  The routine repeatedly subtracts the divisor from a shifted dividend so I was able to leave the divisor in storage rather than repeatedly sourcing it from a register and storing it each time - the inspiration came from looking at a 6502 implementation.  Division is still a painful 399 instructions excluding call/return overhead but that's down from 588 the old way.  There's still some register shuffling at the end that can maybe be cleaned up but this is a major win.

      The Dhrystone benchmark is an old integer C test of processor speed and compiler cleverness.  Overall each pass of the benchmark compiled with LCC1802 is now about 4200 instructions down from something over 7150 for the first try.  My goal is to halve the 7150 to 3575 which would be around 28 Dhrystones/second.  The closest comparison I can find is an Apple IIe using Aztec C which scores 36 Dhrystones/second.

      The divide routine is below for your edification and amusement.  The divisor comes in regarg2 and the dividend in regarg1.  The quotient ends up in retval.
      I am looking at another division algorithm entirely that doesn't always do 16 iterations but I'm not sure how much of an improvement that would be.
      _divU2:
      ; retVal = regArg1/regArg2  (remainder in regArg1)
      ;adapted from ted rossin's work and a 6502 example oct 7, 2013
          ldi 16
          plo rwork
          ldi 0
          phi retVal
          plo retVal
      ;now put the divisor in a work area on the stack and point regarg2 at it as X
          dec sp
          glo regarg2 ;divisor
          stxd
          ghi regarg2
          str sp
          glo sp ;now transfer the work area address to regarg2
          plo regarg2
          ghi sp
          phi regarg2
          sex regarg2 ;and make regarg2 the X register
      $$DivLoop16_162b:
          glo regArg1 ;shift dividend and result left one
          shl
          plo regArg1
          ghi regArg1
          shlc
          phi regArg1
          glo retVal ;shift the carry bit into the remainder
          shlc
          plo retVal
          ghi retVal
          shlc
          phi retVal
      ;now see if the divisor fits in the remainder
          inc regarg2 ;point back at low byte
          glo retval
          sm
          phi rwork ;save partial subtraction result in case we need it
          dec regarg2 ;point to high byte of divisor
          ghi retval
          smb ;subtract it from the remainder
          bnf $$divskip ;didn't fit so carry on shifting unless we're done
          phi retval ;first, save the top byte of the new remainder
          ghi rwork ;now get the low byte back
          plo retval ;put it in the remainder
          inc regarg1 ;increment the result which is growing in regarg1
      $$divskip:
          dec rwork ;now see if we've done the 16 bits
          glo rwork
          bnz $$DivLoop16_162b ;back for more if necessary

          inc sp ;all done so restore the stack and shuffle some registers
          inc sp
          sex sp ;make sure X is sp
      ;here I have the quotient in regArg1 and remainder in retVal
          glo retVal
          plo regArg2
          ghi retVal
          phi regArg2
          glo regArg1
          plo retVal
          ghi regArg1
          phi retVal
          ;return with quotient in retVal and remainder in regArg2
          sep 5


      To: cosmacelf@yahoogroups.com
      From: bill_rowe_ottawa@...
      Date: Sat, 5 Oct 2013 17:04:47 -0400
      Subject: [cosmacelf] The Rhinestone Compiler Passes 22 Dhrystones/Sec

       

      By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.

      The biggest improvements relate to support routines rather than straightline code.  

      I had already tweaked the multiply routine to exit more quickly for small multipliers.  Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds.  For example:
      ;Int_3_Loc = 5 * Int_1_Loc 
      ld2 R13,'O',sp,(84) [loads the local variable into Reg 13]
      cpy2 R15,R13 [so R15=variable*1]
      shl2I R13,1 [R13=variable*2]
      shl2I R13,1 [R13=variable*4]
      alu2 R15,R15,R13,add,adc [now R15=variable*5]

      There's a penalty of code size, of course, this is a good 20 bytes bigger than the original.  Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger.  Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones.  In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.

      I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.

      A good case for the peephole is something like the following

      shrU2I %1,8 [compiler wants to shift an int right 8 bits]
      <is replaced with>
      ghi %1 [move top byte to low and zero the top]
      plo %1
      ldi 0
      phi %1

      There are a bunch of things like that but they just don't show up that often.  


    • Show all 7 messages in this topic