Loading ...
Sorry, an error occurred while loading the content.

The Rhinestone Compiler Passes 22 Dhrystones/Sec

Expand Messages
  • bill rowe
    By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just
    Message 1 of 7 , Oct 5, 2013
    • 0 Attachment
      By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.

      The biggest improvements relate to support routines rather than straightline code.  

      I had already tweaked the multiply routine to exit more quickly for small multipliers.  Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds.  For example:
      ;Int_3_Loc = 5 * Int_1_Loc 
      ld2 R13,'O',sp,(84) [loads the local variable into Reg 13]
      cpy2 R15,R13 [so R15=variable*1]
      shl2I R13,1 [R13=variable*2]
      shl2I R13,1 [R13=variable*4]
      alu2 R15,R15,R13,add,adc [now R15=variable*5]

      There's a penalty of code size, of course, this is a good 20 bytes bigger than the original.  Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger.  Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones.  In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.

      I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.

      A good case for the peephole is something like the following

      shrU2I %1,8 [compiler wants to shift an int right 8 bits]
      <is replaced with>
      ghi %1 [move top byte to low and zero the top]
      plo %1
      ldi 0
      phi %1

      There are a bunch of things like that but they just don't show up that often.  

    • dougmemphis
      Its been fun watching all this progress Bill. Keep up the great work! I haven t downloaded your compiler for a while but you might want to remind the list
      Message 2 of 7 , Oct 6, 2013
      • 0 Attachment

        Its been fun watching all this progress Bill.  Keep up the great work!
        I haven't downloaded your compiler for a while but you might want to

        remind the list about what that's about.



        ---In cosmacelf@yahoogroups.com, <cosmacelf@yahoogroups.com> wrote:

        By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.

        The biggest improvements relate to support routines rather than straightline code.  

        I had already tweaked the multiply routine to exit more quickly for small multipliers.  Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds.  For example:
        ;Int_3_Loc = 5 * Int_1_Loc 
        ld2 R13,'O',sp,(84) [loads the local variable into Reg 13]
        cpy2 R15,R13 [so R15=variable*1]
        shl2I R13,1 [R13=variable*2]
        shl2I R13,1 [R13=variable*4]
        alu2 R15,R15,R13,add,adc [now R15=variable*5]

        There's a penalty of code size, of course, this is a good 20 bytes bigger than the original.  Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger.  Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones.  In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.

        I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.

        A good case for the peephole is something like the following

        shrU2I %1,8 [compiler wants to shift an int right 8 bits]
        <is replaced with>
        ghi %1 [move top byte to low and zero the top]
        plo %1
        ldi 0
        phi %1

        There are a bunch of things like that but they just don't show up that often.  

      • bill rowe
        LCC1802 is an ANSI C Compiler for the 1802 that runs on windows. It is feature complete and supports 16 and 32 bit ints and floating point. The current
        Message 3 of 7 , Oct 8, 2013
        • 0 Attachment

          LCC1802 is an ANSI C Compiler for the 1802 that runs on windows.  It is feature complete and supports 16 and 32 bit ints and floating point.  The current version is called St. Judy's Compiler and can be found at https://sites.google.com/site/lcc1802/   Source code is available on Google Code https://code.google.com/p/lcc1802/

          The Rhinestone Edition of LCC1802 has now soared past the 23 Dhrystones/second mark with a score of 23.75.  The most recent big improvement came from replacing the 16 bit division routine.  I've been using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I never *really* understood the algorithms.  I had a hard hard look at the division routine and re-did it to make better use of the 1802.  The routine repeatedly subtracts the divisor from a shifted dividend so I was able to leave the divisor in storage rather than repeatedly sourcing it from a register and storing it each time - the inspiration came from looking at a 6502 implementation.  Division is still a painful 399 instructions excluding call/return overhead but that's down from 588 the old way.  There's still some register shuffling at the end that can maybe be cleaned up but this is a major win.

          The Dhrystone benchmark is an old integer C test of processor speed and compiler cleverness.  Overall each pass of the benchmark compiled with LCC1802 is now about 4200 instructions down from something over 7150 for the first try.  My goal is to halve the 7150 to 3575 which would be around 28 Dhrystones/second.  The closest comparison I can find is an Apple IIe using Aztec C which scores 36 Dhrystones/second.

          The divide routine is below for your edification and amusement.  The divisor comes in regarg2 and the dividend in regarg1.  The quotient ends up in retval.
          I am looking at another division algorithm entirely that doesn't always do 16 iterations but I'm not sure how much of an improvement that would be.
          _divU2:
          ; retVal = regArg1/regArg2  (remainder in regArg1)
          ;adapted from ted rossin's work and a 6502 example oct 7, 2013
              ldi 16
              plo rwork
              ldi 0
              phi retVal
              plo retVal
          ;now put the divisor in a work area on the stack and point regarg2 at it as X
              dec sp
              glo regarg2 ;divisor
              stxd
              ghi regarg2
              str sp
              glo sp ;now transfer the work area address to regarg2
              plo regarg2
              ghi sp
              phi regarg2
              sex regarg2 ;and make regarg2 the X register
          $$DivLoop16_162b:
              glo regArg1 ;shift dividend and result left one
              shl
              plo regArg1
              ghi regArg1
              shlc
              phi regArg1
              glo retVal ;shift the carry bit into the remainder
              shlc
              plo retVal
              ghi retVal
              shlc
              phi retVal
          ;now see if the divisor fits in the remainder
              inc regarg2 ;point back at low byte
              glo retval
              sm
              phi rwork ;save partial subtraction result in case we need it
              dec regarg2 ;point to high byte of divisor
              ghi retval
              smb ;subtract it from the remainder
              bnf $$divskip ;didn't fit so carry on shifting unless we're done
              phi retval ;first, save the top byte of the new remainder
              ghi rwork ;now get the low byte back
              plo retval ;put it in the remainder
              inc regarg1 ;increment the result which is growing in regarg1
          $$divskip:
              dec rwork ;now see if we've done the 16 bits
              glo rwork
              bnz $$DivLoop16_162b ;back for more if necessary

              inc sp ;all done so restore the stack and shuffle some registers
              inc sp
              sex sp ;make sure X is sp
          ;here I have the quotient in regArg1 and remainder in retVal
              glo retVal
              plo regArg2
              ghi retVal
              phi regArg2
              glo regArg1
              plo retVal
              ghi regArg1
              phi retVal
              ;return with quotient in retVal and remainder in regArg2
              sep 5


          To: cosmacelf@yahoogroups.com
          From: bill_rowe_ottawa@...
          Date: Sat, 5 Oct 2013 17:04:47 -0400
          Subject: [cosmacelf] The Rhinestone Compiler Passes 22 Dhrystones/Sec

           

          By dint of much hard work and low cunning I have gotten an overall 60% improvement in the dhrystone benchmark score up to 22.5 from the original score of just under 14.

          The biggest improvements relate to support routines rather than straightline code.  

          I had already tweaked the multiply routine to exit more quickly for small multipliers.  Today I changed the part of the compiler that emits instructions to recognize when it was working with a small constant operand and code the operation as a series of shifts and adds.  For example:
          ;Int_3_Loc = 5 * Int_1_Loc 
          ld2 R13,'O',sp,(84) [loads the local variable into Reg 13]
          cpy2 R15,R13 [so R15=variable*1]
          shl2I R13,1 [R13=variable*2]
          shl2I R13,1 [R13=variable*4]
          alu2 R15,R15,R13,add,adc [now R15=variable*5]

          There's a penalty of code size, of course, this is a good 20 bytes bigger than the original.  Realistically, you wouldn't want it to do this for constants bigger than maybe 7 - with a constant of 100 it's almost 60 bytes bigger.  Execution is much much quicker though because you avoid six 16 bit shifts, a bunch of loop control stuff, and the call/return overhead. probably 200 instructions saved for very small numbers and 50 for larger ones.  In net, this took out almost 700 instructions from the benchmark execution time because there are a bunch of constant multiplies for subscript calculation.

          I keep plugging away at the tiny gains from peephole optimization but that kind of thing is much more gratifying.

          A good case for the peephole is something like the following

          shrU2I %1,8 [compiler wants to shift an int right 8 bits]
          <is replaced with>
          ghi %1 [move top byte to low and zero the top]
          plo %1
          ldi 0
          phi %1

          There are a bunch of things like that but they just don't show up that often.  


        • David W. Schultz
          ... This tickled a memory so I dug into the archive and found the code for 32 bit non-restoring division that I sent you. After hacking it down to 16 bits I
          Message 4 of 7 , Oct 10, 2013
          • 0 Attachment
            On 10/08/2013 10:26 AM, bill rowe wrote:
            > The Rhinestone Edition of LCC1802 has now soared past the 23
            > Dhrystones/second mark with a score of 23.75. The most recent big
            > improvement came from replacing the 16 bit division routine. I've been
            > using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I
            > never *really* understood the algorithms. I had a hard hard look at the
            > division routine and re-did it to make better use of the 1802. The
            > routine repeatedly subtracts the divisor from a shifted dividend so I
            > was able to leave the divisor in storage rather than repeatedly sourcing
            > it from a register and storing it each time - the inspiration came from
            > looking at a 6502 implementation. Division is still a painful 399
            > instructions excluding call/return overhead but that's down from 588 the
            > old way. There's still some register shuffling at the end that can
            > maybe be cleaned up but this is a major win.

            This tickled a memory so I dug into the archive and found the code for
            32 bit non-restoring division that I sent you.

            After hacking it down to 16 bits I compared it to your code.

            The shift is effectively identical:

            dloop: inc TEMPP ; Adjust index to point at LSB

            glo Quo ; shift quotient left
            shlc
            plo Quo
            ghi Quo
            shlc
            phi Quo

            ldn TEMPP ; shift remainder
            shlc
            stxd
            ldn TEMPP
            shlc
            str TEMPP

            inc TEMPP ; back to LSB

            as is the subtraction:

            glo Div ; do subtraction
            sd
            stxd
            ghi Div
            sdb
            str TEMPP

            The difference is that this is a non-restoring algorithm. Instead of
            performing the subtraction and then backing out of it, the NR algorithm
            performs either an addition or a subtraction at each step. The code for
            this is:

            glo Quo ; check lsb
            ani 1
            bdf doadd

            Plus an additional branch at the end of the subtraction code.

            Add three more instructions for the loop counter and that is 24
            instructions/loop if a subtraction is performed (because of the extra
            branch) or 23 if addition is used. Call it 23.5. 16 loops gives 376
            instructions. Plus some more at the end to shift in the last bit of the
            quotient (6) and to correct the remainder as required. 3 to check plus 6
            more to correct if needed. For a total of 388 instructions on average.

            Your code uses 23 instructions/loop if the restore can be skipped and 27
            if it can't. Call it 25/loop * 16 is 400.

            So the non-restoring version is slightly faster.

            After looking at the two versions for a bit I think I can shave another
            instruction per loop off that by reversing the subtract operation. Thus
            saving one instruction to adjust the pointer. Make that 372 instructions
            on average.


            --
            David W. Schultz
            http://home.earthlink.net/~david.schultz
            Returned for Regrooving
          • bill rowe
            Thanks very much David. I had not thought to look at that. I am currently experimenting with yet another division algorithm that is quicker for smaller
            Message 5 of 7 , Oct 11, 2013
            • 0 Attachment
              Thanks very much David.  I had not thought to look at that.

              I am currently experimenting with yet another division algorithm that is quicker for smaller operands but worse for larger numbers of bits(and negative numbers).  For the division in dhrystone the operands are generally both less than ten and this cuts the calculation down to under 100 instructions(!).  It's a bit cheat-y but the worst cases are under 450 instructions so I may leave this as the default algorithm.  It's a bit messy to read my code but the paper ref'd in the credits is clear. Interestingly, I did not try to leave any of the operands on the stack since they all change during the calculation but I note that your non-restoring algorithm does leave one on the stack even though it gets shifted.  I may see if that would help here.

              The multiplication algorithm follows and again it's at its best with small positive numbers which is pretty real-world friendly.  It's much simpler and shorter than the division but the source code link is given for reference:

              ;this is a divisor shifting algorithm which is faster for smaller operands
              ;credit to http://research.microsoft.com/pubs/70645/tr-2008-141.pdf
              _divu2: ;retval=regarg1(x or dividend)/regarg2(y or divisor)
              ;uses R14(rwork) as temp, R10 to hold remainder, R11 to save original divisor y0
              testsub: macro reg1,reg2 ;test subtraction of reg2 from reg1
              ;result in D, rwork.hi, DF
              glo reg2
              str sp
              glo reg1
              sm
              phi rwork
              ghi reg2
              str sp
              ghi reg1
              smb
              endm
              cpy2 R10,regarg1
              ld2z retval
              cpy2 R11,regarg2
              dec sp
              testsub regarg1,regarg2
              bnf $$computequot ;DF=0 means it didn't fit
              phi regarg1 ;regarg1=regarg1-regarg2
              ghi rwork
              plo regarg1
              $$again: ;this is the divisor doubling phase
                testsub regarg1,regarg2
                bnf $$computequot ;df=0 means it didn't fit
              phi regarg1 ;regarg1=regarg1-regarg2
              ghi rwork
              plo regarg1 
                shl2 regarg2 ;y=y+y
                br $$again
               
               $$computequot: ;here we're computing the quotient
                testsub R10,regarg2
                bnf $$testexit
                phi R10 ;complete the subtraction
                ghi rwork
                plo R10
                inc retval
               $$testexit:
                ghi R11
                sm ;top of regarg2 is still on stack
                bnz $$ney0y
                glo regarg2
                str sp
                glo R11
                sm ;test low order bytes
                bz $$out ;if = we're done
               $$ney0y:
                shl2 retval ;double quotient
                shrU2 regarg2 ;halve divisor
                br $$computequot ;continue
               $$out:
                inc sp ;release work area
               ;here the quotient is in retval, remainder in R10
                cpy2 regarg2,R10
              cretn ;and we're done


              ;16 bit right shifting multiply which is faster for smaller operands
              ; credit to http://map.grauw.nl/articles/mult_div_shifts.php#lrmultr
              _mulU2: ;retval(product)=regarg1(multiplicand)*regarg2(multiplier)
              ld2z retval
              $$mulrlp:
              shru2 regarg2
              bnf $$mulrnoadd
              ; bottom bit of multiplier is one so add multiplicand to product
              alu2 retval,retval,regarg1,add,adc
              $$mulrnoadd:
              ghi regarg2 ;check for all bits of multiplier shifted out
              bnz $$mulrshft ;nope, continue
              glo regarg2 ;check bottom byte
              bz $$mulrdone
              $$mulrshft:
              shl2 regarg1 ;shift multiplicand left 1
              br $$mulrlp
              $$mulrdone: ;here the product is in retval
              cretn




              > To: cosmacelf@yahoogroups.com
              > From: david.schultz@...
              > Date: Thu, 10 Oct 2013 19:01:51 -0500
              > Subject: Re: [cosmacelf] 23 and counting with much improved division
              >
              > On 10/08/2013 10:26 AM, bill rowe wrote:
              > > The Rhinestone Edition of LCC1802 has now soared past the 23
              > > Dhrystones/second mark with a score of 23.75. The most recent big
              > > improvement came from replacing the 16 bit division routine. I've been
              > > using routines I cribbed from Ted Rossin's C compiler(thanks Ted!) but I
              > > never *really* understood the algorithms. I had a hard hard look at the
              > > division routine and re-did it to make better use of the 1802. The
              > > routine repeatedly subtracts the divisor from a shifted dividend so I
              > > was able to leave the divisor in storage rather than repeatedly sourcing
              > > it from a register and storing it each time - the inspiration came from
              > > looking at a 6502 implementation. Division is still a painful 399
              > > instructions excluding call/return overhead but that's down from 588 the
              > > old way. There's still some register shuffling at the end that can
              > > maybe be cleaned up but this is a major win.
              >
              > This tickled a memory so I dug into the archive and found the code for
              > 32 bit non-restoring division that I sent you.
              >
              > After hacking it down to 16 bits I compared it to your code.
              >
              > The shift is effectively identical:
              >
              > dloop: inc TEMPP ; Adjust index to point at LSB
              >
              > glo Quo ; shift quotient left
              > shlc
              > plo Quo
              > ghi Quo
              > shlc
              > phi Quo
              >
              > ldn TEMPP ; shift remainder
              > shlc
              > stxd
              > ldn TEMPP
              > shlc
              > str TEMPP
              >
              > inc TEMPP ; back to LSB
              >
              > as is the subtraction:
              >
              > glo Div ; do subtraction
              > sd
              > stxd
              > ghi Div
              > sdb
              > str TEMPP
              >
              > The difference is that this is a non-restoring algorithm. Instead of
              > performing the subtraction and then backing out of it, the NR algorithm
              > performs either an addition or a subtraction at each step. The code for
              > this is:
              >
              > glo Quo ; check lsb
              > ani 1
              > bdf doadd
              >
              > Plus an additional branch at the end of the subtraction code.
              >
              > Add three more instructions for the loop counter and that is 24
              > instructions/loop if a subtraction is performed (because of the extra
              > branch) or 23 if addition is used. Call it 23.5. 16 loops gives 376
              > instructions. Plus some more at the end to shift in the last bit of the
              > quotient (6) and to correct the remainder as required. 3 to check plus 6
              > more to correct if needed. For a total of 388 instructions on average.
              >
              > Your code uses 23 instructions/loop if the restore can be skipped and 27
              > if it can't. Call it 25/loop * 16 is 400.
              >
              > So the non-restoring version is slightly faster.
              >
              > After looking at the two versions for a bit I think I can shave another
              > instruction per loop off that by reversing the subtract operation. Thus
              > saving one instruction to adjust the pointer. Make that 372 instructions
              > on average.
              >
              >
              > --
              > David W. Schultz
              > http://home.earthlink.net/~david.schultz
              > Returned for Regrooving
              >
              >
              >
              >
              > ------------------------------------
              >
              > ========================================================
              > Questions? Check the FAQ at http://www.cosmacelf.com/forumfaq.html
              > Visit the COSMAC ELF website at http://www.cosmacelf.comYahoo! Groups Links
              >
              > <*> To visit your group on the web, go to:
              > http://groups.yahoo.com/group/cosmacelf/
              >
              > <*> Your email settings:
              > Individual Email | Traditional
              >
              > <*> To change settings online go to:
              > http://groups.yahoo.com/group/cosmacelf/join
              > (Yahoo! ID required)
              >
              > <*> To change settings via email:
              > cosmacelf-digest@yahoogroups.com
              > cosmacelf-fullfeatured@yahoogroups.com
              >
              > <*> To unsubscribe from this group, send an email to:
              > cosmacelf-unsubscribe@yahoogroups.com
              >
              > <*> Your use of Yahoo! Groups is subject to:
              > http://info.yahoo.com/legal/us/yahoo/utos/terms/
              >
            • David W. Schultz
              Some musings on multiplication. My preferred algorithm does a 16X16 multiply and returns a 32 bit result. But that really isn t required for a C compiler which
              Message 6 of 7 , Oct 13, 2013
              • 0 Attachment
                Some musings on multiplication.

                My preferred algorithm does a 16X16 multiply and returns a 32 bit
                result. But that really isn't required for a C compiler which always
                throws away the upper half of the result. That has some consequences.

                By trading a loop counter which costs 3 instructions for a test to see
                if the multiplier is zero that costs 4 instructions you get faster
                execution in the general case. The reason is simple. A multiply of
                256X256 overflows the 16 bit result. The results of an overflow are not
                specified so it doesn't matter what happens.

                By adding an extra test you can make sure that you are as fast as
                possible. The key is that you want the smaller number to be the
                multiplier since this controls the number of loop iterations. Check it
                at the beginning and swap the operands as required. Once you do the swap
                the multiplier is guaranteed to be less than 256 unless the programmer
                has passed arguments that result in an overflow. In which case who cares?

                The result is that the number of loop iterations is never more than 8.

                If you don't like the cost of a full blown compare, just test the upper
                byte of the multiplier. If it is zero, proceed as usual. If not, swap
                the multiplier and multiplicand. This will be slower in some cases where
                both are less than 256 and differ by factor of 2 or more.


                --
                David W. Schultz
                http://home.earthlink.net/~david.schultz
                Returned for Regrooving
              • bill rowe
                David: Thanks for the thoughts. My current mult routine is below. It doesn t have an explicit loop counter but it does perform better if the multiplier is
                Message 7 of 7 , Oct 13, 2013
                • 0 Attachment
                  David: Thanks for the thoughts.  My current mult routine is below.  It doesn't have an explicit loop counter but it does perform better if the multiplier is smaller.  I could implement the swap in the preamble if, say, the top of the multiplier is not zero.  That seems less likely to make things worse.  

                  I do note, looking at the code again, that I could shed a couple of instructions by pulling the 1st multiplier shift up out of the loop so that i either initialize the product with zero or the multiplicand.  I'll think about it.


                  ;16 bit right shifting multiply which is faster for smaller operands
                  ; credit to http://map.grauw.nl/articles/mult_div_shifts.php#lrmultr

                  _mulU2: ;retval(product)=regarg1(multiplicand)*regarg2(multiplier)
                        ld2z retval
                  $$mulrlp:
                        shru2 regarg2
                        bnf $$mulrnoadd
                  ; bottom bit of multiplier is one so add multiplicand to product
                        alu2 retval,retval,regarg1,add,adc
                  $$mulrnoadd:
                        ghi regarg2 ;check for all bits of multiplier shifted out
                        bnz $$mulrshft ;nope, continue
                        glo regarg2 ;check bottom byte
                        bz $$mulrdone
                  $$mulrshft:
                        shl2 regarg1 ;shift multiplicand left 1
                        br $$mulrlp
                  $$mulrdone: ;here the product is in retval
                        cretn



                  To: cosmacelf@yahoogroups.com
                  From: david.schultz@...
                  Date: Sun, 13 Oct 2013 11:25:27 -0500
                  Subject: Re: [cosmacelf] 23 and counting with much improved division

                   
                  Some musings on multiplication.

                  My preferred algorithm does a 16X16 multiply and returns a 32 bit
                  result. But that really isn't required for a C compiler which always
                  throws away the upper half of the result. That has some consequences.

                  By trading a loop counter which costs 3 instructions for a test to see
                  if the multiplier is zero that costs 4 instructions you get faster
                  execution in the general case. The reason is simple. A multiply of
                  256X256 overflows the 16 bit result. The results of an overflow are not
                  specified so it doesn't matter what happens.

                  By adding an extra test you can make sure that you are as fast as
                  possible. The key is that you want the smaller number to be the
                  multiplier since this controls the number of loop iterations. Check it
                  at the beginning and swap the operands as required. Once you do the swap
                  the multiplier is guaranteed to be less than 256 unless the programmer
                  has passed arguments that result in an overflow. In which case who cares?

                  The result is that the number of loop iterations is never more than 8.

                  If you don't like the cost of a full blown compare, just test the upper
                  byte of the multiplier. If it is zero, proceed as usual. If not, swap
                  the multiplier and multiplicand. This will be slower in some cases where
                  both are less than 256 and differ by factor of 2 or more.

                  --
                  David W. Schultz
                  http://home.earthlink.net/~david.schultz
                  Returned for Regrooving


                Your message has been successfully submitted and would be delivered to recipients shortly.