1448623 and counting with much improved division
- Oct 8, 2013LCC1802 is an ANSI C Compiler for the 1802 that runs on windows. It is feature complete and supports 16 and 32 bit ints and floating point. The current version is called St. Judy's Compiler and can be found at https://sites.google.com/site/lcc1802/ Source code is available on Google Code https://code.google.com/p/lcc1802/The Rhinestone Edition of LCC1802 has now soared past the 23 Dhrystones/second mark with a score of 23.75. The most recent big improvement came from replacing the 16 bit division routine. I've been using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I never *really* understood the algorithms. I had a hard hard look at the division routine and re-did it to make better use of the 1802. The routine repeatedly subtracts the divisor from a shifted dividend so I was able to leave the divisor in storage rather than repeatedly sourcing it from a register and storing it each time - the inspiration came from looking at a 6502 implementation. Division is still a painful 399 instructions excluding call/return overhead but that's down from 588 the old way. There's still some register shuffling at the end that can maybe be cleaned up but this is a major win.The Dhrystone benchmark is an old integer C test of processor speed and compiler cleverness. Overall each pass of the benchmark compiled with LCC1802 is now about 4200 instructions down from something over 7150 for the first try. My goal is to halve the 7150 to 3575 which would be around 28 Dhrystones/second. The closest comparison I can find is an Apple IIe using Aztec C which scores 36 Dhrystones/second.The divide routine is below for your edification and amusement. The divisor comes in regarg2 and the dividend in regarg1. The quotient ends up in retval.I am looking at another division algorithm entirely that doesn't always do 16 iterations but I'm not sure how much of an improvement that would be._divU2:; retVal = regArg1/regArg2 (remainder in regArg1);adapted from ted rossin's work and a 6502 example oct 7, 2013ldi 16plo rworkldi 0phi retValplo retVal;now put the divisor in a work area on the stack and point regarg2 at it as Xdec spglo regarg2 ;divisorstxdghi regarg2str spglo sp ;now transfer the work area address to regarg2plo regarg2ghi spphi regarg2sex regarg2 ;and make regarg2 the X register$$DivLoop16_162b:glo regArg1 ;shift dividend and result left oneshlplo regArg1ghi regArg1shlcphi regArg1glo retVal ;shift the carry bit into the remaindershlcplo retValghi retValshlcphi retVal;now see if the divisor fits in the remainderinc regarg2 ;point back at low byteglo retvalsmphi rwork ;save partial subtraction result in case we need itdec regarg2 ;point to high byte of divisorghi retvalsmb ;subtract it from the remainderbnf $$divskip ;didn't fit so carry on shifting unless we're donephi retval ;first, save the top byte of the new remainderghi rwork ;now get the low byte backplo retval ;put it in the remainderinc regarg1 ;increment the result which is growing in regarg1$$divskip:dec rwork ;now see if we've done the 16 bitsglo rworkbnz $$DivLoop16_162b ;back for more if necessaryinc sp ;all done so restore the stack and shuffle some registersinc spsex sp ;make sure X is sp;here I have the quotient in regArg1 and remainder in retValglo retValplo regArg2ghi retValphi regArg2glo regArg1plo retValghi regArg1phi retVal;return with quotient in retVal and remainder in regArg2sep 5
Date: Sat, 5 Oct 2013 17:04:47 -0400
Subject: [cosmacelf] The Rhinestone Compiler Passes 22 Dhrystones/Sec
By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.The biggest improvements relate to support routines rather than straightline code.I had already tweaked the multiply routine to exit more quickly for small multipliers. Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds. For example:;Int_3_Loc = 5 * Int_1_Locld2 R13,'O',sp,(84) [loads the local variable into Reg 13]cpy2 R15,R13 [so R15=variable*1]shl2I R13,1 [R13=variable*2]shl2I R13,1 [R13=variable*4]alu2 R15,R15,R13,add,adc [now R15=variable*5]There's a penalty of code size, of course, this is a good 20 bytes bigger than the original. Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger. Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones. In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.A good case for the peephole is something like the followingshrU2I %1,8 [compiler wants to shift an int right 8 bits]<is replaced with>ghi %1 [move top byte to low and zero the top]plo %1ldi 0phi %1There are a bunch of things like that but they just don't show up that often.
- << Previous post in topic Next post in topic >>