Loading ...
Sorry, an error occurred while loading the content.
 

RE: Lehman factorization algorithm

Expand Messages
  • Kermit Rose
    ... Here is trivial illustration with N = 25. ... I take T = 1, N = 25. ... B = 2 ... k = 0 ... k = 1 m = 4 r = 2 a = 7 -- 7%4 == 3 -- do not use a = 8 --
    Message 1 of 12 , Jan 2, 2012
      On 1/2/2012 6:26 AM, primenumbers@yahoogroups.com wrote:
      > 1a. Lehman factorization algorithm, theoretical comparison with SQUFOF
      > Posted by: "WarrenS"warren.wds@... warren_d_smith31
      > Date: Sun Jan 1, 2012 4:35 pm ((PST))

      Here is trivial illustration with N = 25.

      > In this post I explain R.S.Lehman's factorization algorithm, provide a proof and runtime analysis, and compare with SQUFOF.
      >
      > 1. The algorithm. Let T>0 be a tunable constant. [I will prove validity if T>=1.]

      I take T = 1, N = 25.
      >
      > Given odd integer N>2, finds nontrivial factor (or proves N prime).
      > B = floor(N^(1/3) * T);
      B = 2
      >
      > Trial divide N up to bound B.
      > if(factor found) return(Factor);
      > k=0;
      k = 0

      >
      > LoopForever{
      > k++; if(k*T^3 > B){ return(N is prime); }
      > if(k even){ m=2; r=1; }else{ m=4; r=(k+N)%4; }
      > for(all integers a with 0<=a*a-4*k*N<=B*B and with a%m==r){
      > c = a*a-4*k*N;
      > Test whether c is square;
      > if(issquare){
      > b=sqrt(c);
      > return( gcd(a+b, N) );
      > }
      > }
      > }

      k = 1
      m = 4
      r = 2
      a = 7 --> 7%4 == 3 --> do not use
      a = 8 --> 8%4 == 0 --> do not use
      a = 9 --> 9%4 == 1 --> do not use
      a = 10 --> 10%4 == 2 --> use

      c = 10*10 - 4 * 1 * 25 = 0
      c is square

      b = 0
      return ( GCD(10+0,25))
    • WarrenS
      Good luck trying to decrypt this code... it is the fastest GCD code I have at present. //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz. //Warren D.
      Message 2 of 12 , Jan 3, 2012
        Good luck trying to decrypt this code... it is the fastest
        GCD code I have at present.

        //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz.
        //Warren D. Smith December 2011.
        //For comparison on same machine x += a%b takes 19.5 nanosec.
        uint64 gcd64(uint64 a, uint64 b){
        int s;
        int64 d;
        if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x
        for(s=0; ((a|b)&1)==0; s++){ a >>= 1; b >>= 1; }
        while((a&1)==0){ a >>= 1; }
        for(;;){
        if( (b|a) < 32 ){ return( ((uint64)SmallGCDarray[a][b]) << s ); } //32x32 precomputed array
        while((b&1)==0){ b>>=1; }
        d = b; d -= a;
        d &= d>>63; //where 63+1=wordsize of uint64s
        b -= d+d+a;
        a += d; //the obvious "optimization" of this & previous line... makes it slower!
        b >>= 1;
        if(b==0){ return( a << s ); }
        }
        }
      • Maximilian Hasler
        ... I can t imagine that d=b-a; (...) a += d ; b -= d+a would be slower. Anyway, the compiler should know (for these 2-3 instances) what is the fastest code
        Message 3 of 12 , Jan 3, 2012
          > (...)
          > d = b; d -= a;
          > (...)
          > b -= d+d+a;
          > a += d; //the obvious "optimization" of this & previous line... makes it slower!
          > (...)


          I can't imagine that

          d=b-a;
          (...)
          a += d ; b -= d+a

          would be slower.
          Anyway, the compiler should know (for these 2-3 instances) what is the
          fastest code for a given processor. It may not be the same if you
          compile it for another iThing.

          Best wishes for the New Year to all of the primenumbers group,

          Maximilian
        • WarrenS
          ... --it is slower! On my computer, anyhow.
          Message 4 of 12 , Jan 3, 2012
            --- In primenumbers@yahoogroups.com, Maximilian Hasler <maximilian.hasler@...> wrote:
            >
            > > (...)
            > > d = b; d -= a;
            > > (...)
            > > b -= d+d+a;
            > > a += d; //the obvious "optimization" of this & previous line... makes it slower!
            > > (...)
            >
            >
            > I can't imagine that
            >
            > d=b-a;
            > (...)
            > a += d ; b -= d+a
            >
            > would be slower.

            --it is slower! On my computer, anyhow.
          • Jack Brennen
            Is it faster if you replace this: for(s=0; ((a|b)&1)==0; s++){ a = 1; b = 1; } while((a&1)==0){ a = 1; } with this: if ((a&1)==0) { s = __builtin_ctz(a|b);
            Message 5 of 12 , Jan 3, 2012
              Is it faster if you replace this:

              for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
              while((a&1)==0){ a>>= 1; }

              with this:

              if ((a&1)==0) {
              s = __builtin_ctz(a|b); b>>=s;
              a >>= __builtin_ctz(a);
              }

              (or the equivalent intrinsic for your compiler)?

              That would seem to take care of fast-pathing the cases where no
              adjustment is made because a is odd, while taking advantage of
              the CPU's ability to count trailing zeroes in a single instruction.


              On 1/3/2012 11:30 AM, WarrenS wrote:
              > Good luck trying to decrypt this code... it is the fastest
              > GCD code I have at present.
              >
              > //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz.
              > //Warren D. Smith December 2011.
              > //For comparison on same machine x += a%b takes 19.5 nanosec.
              > uint64 gcd64(uint64 a, uint64 b){
              > int s;
              > int64 d;
              > if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x
              > for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
              > while((a&1)==0){ a>>= 1; }
              > for(;;){
              > if( (b|a)< 32 ){ return( ((uint64)SmallGCDarray[a][b])<< s ); } //32x32 precomputed array
              > while((b&1)==0){ b>>=1; }
              > d = b; d -= a;
              > d&= d>>63; //where 63+1=wordsize of uint64s
              > b -= d+d+a;
              > a += d; //the obvious "optimization" of this& previous line... makes it slower!
              > b>>= 1;
              > if(b==0){ return( a<< s ); }
              > }
              > }
              >
              >
              >
              >
              > ------------------------------------
              >
              > Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com
              > The Prime Pages : http://www.primepages.org/
              >
              > Yahoo! Groups Links
              >
              >
              >
              >
              >
            • Jack Brennen
              Make that: s=0; if ((a&1)==0) { s = __builtin_ctz(a|b); b =s; a = __builtin_ctz(a); }
              Message 6 of 12 , Jan 3, 2012
                Make that:

                s=0; if ((a&1)==0) {
                s = __builtin_ctz(a|b); b>>=s;
                a>>= __builtin_ctz(a);
                }

                On 1/3/2012 2:40 PM, Jack Brennen wrote:
                > Is it faster if you replace this:
                >
                > for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
                > while((a&1)==0){ a>>= 1; }
                >
                > with this:
                >
                > if ((a&1)==0) {
                > s = __builtin_ctz(a|b); b>>=s;
                > a>>= __builtin_ctz(a);
                > }
                >
                > (or the equivalent intrinsic for your compiler)?
                >
                > That would seem to take care of fast-pathing the cases where no
                > adjustment is made because a is odd, while taking advantage of
                > the CPU's ability to count trailing zeroes in a single instruction.
                >
                >
                > On 1/3/2012 11:30 AM, WarrenS wrote:
                >> Good luck trying to decrypt this code... it is the fastest
                >> GCD code I have at present.
                >>
                >> //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz.
                >> //Warren D. Smith December 2011.
                >> //For comparison on same machine x += a%b takes 19.5 nanosec.
                >> uint64 gcd64(uint64 a, uint64 b){
                >> int s;
                >> int64 d;
                >> if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x
                >> for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
                >> while((a&1)==0){ a>>= 1; }
                >> for(;;){
                >> if( (b|a)< 32 ){ return( ((uint64)SmallGCDarray[a][b])<< s ); } //32x32 precomputed array
                >> while((b&1)==0){ b>>=1; }
                >> d = b; d -= a;
                >> d&= d>>63; //where 63+1=wordsize of uint64s
                >> b -= d+d+a;
                >> a += d; //the obvious "optimization" of this& previous line... makes it slower!
                >> b>>= 1;
                >> if(b==0){ return( a<< s ); }
                >> }
                >> }
                >>
                >>
                >>
                >>
                >> ------------------------------------
                >>
                >> Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com
                >> The Prime Pages : http://www.primepages.org/
                >>
                >> Yahoo! Groups Links
                >>
                >>
                >>
                >>
                >>
                >
                >
                >
                > ------------------------------------
                >
                > Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com
                > The Prime Pages : http://www.primepages.org/
                >
                > Yahoo! Groups Links
                >
                >
                >
                >
                >
              • Jack Brennen
                Apologies once again, but I don t want to leave broken code out there... It should be: s=0; if ((a&1)==0) { s = __builtin_ctzll(a|b); b =s; a =
                Message 7 of 12 , Jan 3, 2012
                  Apologies once again, but I don't want to leave broken code out there...
                  It should be:

                  s=0; if ((a&1)==0) {
                  s = __builtin_ctzll(a|b); b>>=s;
                  a>>= __builtin_ctzll(a);
                  }

                  On 1/3/2012 2:40 PM, Jack Brennen wrote:
                  > Is it faster if you replace this:
                  >
                  > for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
                  > while((a&1)==0){ a>>= 1; }
                  >
                  > with this:
                  >
                  > if ((a&1)==0) {
                  > s = __builtin_ctz(a|b); b>>=s;
                  > a>>= __builtin_ctz(a);
                  > }
                  >
                  > (or the equivalent intrinsic for your compiler)?
                  >
                  > That would seem to take care of fast-pathing the cases where no
                  > adjustment is made because a is odd, while taking advantage of
                  > the CPU's ability to count trailing zeroes in a single instruction.
                  >
                  >
                  > On 1/3/2012 11:30 AM, WarrenS wrote:
                  >> Good luck trying to decrypt this code... it is the fastest
                  >> GCD code I have at present.
                  >>
                  >> //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz.
                  >> //Warren D. Smith December 2011.
                  >> //For comparison on same machine x += a%b takes 19.5 nanosec.
                  >> uint64 gcd64(uint64 a, uint64 b){
                  >> int s;
                  >> int64 d;
                  >> if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x
                  >> for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
                  >> while((a&1)==0){ a>>= 1; }
                  >> for(;;){
                  >> if( (b|a)< 32 ){ return( ((uint64)SmallGCDarray[a][b])<< s ); } //32x32 precomputed array
                  >> while((b&1)==0){ b>>=1; }
                  >> d = b; d -= a;
                  >> d&= d>>63; //where 63+1=wordsize of uint64s
                  >> b -= d+d+a;
                  >> a += d; //the obvious "optimization" of this& previous line... makes it slower!
                  >> b>>= 1;
                  >> if(b==0){ return( a<< s ); }
                  >> }
                  >> }
                  >>
                  >>
                  >>
                  >>
                  >> ------------------------------------
                  >>
                  >> Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com
                  >> The Prime Pages : http://www.primepages.org/
                  >>
                  >> Yahoo! Groups Links
                  >>
                  >>
                  >>
                  >>
                  >>
                  >
                  >
                  >
                  > ------------------------------------
                  >
                  > Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com
                  > The Prime Pages : http://www.primepages.org/
                  >
                  > Yahoo! Groups Links
                  >
                  >
                  >
                  >
                  >
                • WarrenS
                  ... --thanks! That hack sped it up from 260 to 258 nanosec. You also need to add an s=0 to your code fragment otherwise s could be used without having been
                  Message 8 of 12 , Jan 3, 2012
                    --- In primenumbers@yahoogroups.com, Jack Brennen <jfb@...> wrote:
                    >
                    > Is it faster if you replace this:
                    >
                    > for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
                    > while((a&1)==0){ a>>= 1; }
                    >
                    > with this:
                    >
                    > if ((a&1)==0) {
                    > s = __builtin_ctz(a|b); b>>=s;
                    > a >>= __builtin_ctz(a);
                    > }
                    >
                    > (or the equivalent intrinsic for your compiler)?
                    >
                    > That would seem to take care of fast-pathing the cases where no
                    > adjustment is made because a is odd, while taking advantage of
                    > the CPU's ability to count trailing zeroes in a single instruction.

                    --thanks! That hack sped it up from 260 to 258 nanosec.
                    You also need to add an "s=0" to your code fragment otherwise s could be used
                    without having been zeroed.

                    Bu then I applied your same idea in the second place and that sped it up to 142
                    nanosec!!! New code:


                    //Binary bithacked algorithm, 142 nanosec on iMac 2.0 GHz.
                    //Warren D. Smith December 2011.
                    //For comparison on same machine x += a%b takes 19.5 nanosec.
                    uint64 gcd64(uint64 a, uint64 b){
                    int s;
                    int64 d;
                    if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x

                    s=0;
                    if ((a&1)==0) { //this code improved here by Jack Brennen
                    s = __builtin_ctz(a|b); b>>=s;
                    a >>= __builtin_ctz(a);
                    }

                    while( (b|a) > 31 ){
                    b >>= __builtin_ctz(b);
                    d = b; d -= a;
                    d &= d>>63; //where 63+1=wordsize of uint64s
                    b -= d+d+a;
                    a += d;
                    b >>= 1;
                    if(b==0){ return( a << s ); }
                    }
                    return( ((uint64)SmallGCDarray[a][b]) << s ); //32x32 precomputed array
                    }
                  • WarrenS
                    Yes, as per Brennan s bug fix, the ctz s should be ctzll s. That also speeds it up to 139 nanosec.
                    Message 9 of 12 , Jan 3, 2012
                      Yes, as per Brennan's bug fix, the ctz's should be ctzll's.

                      That also speeds it up to 139 nanosec.
                    • Phil Carmody
                      From: WarrenS ... I think Bob S posted his highly tuned code to sci.math, or just possibly the Mersenne Forum, about 5 years ago. It had special cases for
                      Message 10 of 12 , Jan 4, 2012
                        From: WarrenS
                        > Yes, as per Brennan's bug fix, the
                        > ctz's should be ctzll's.
                        >
                        > That also speeds it up to 139 nanosec.

                        I think Bob S posted his highly tuned code to sci.math, or just possibly the Mersenne Forum, about 5 years ago. It had special cases for removing small multiples of the smaller number, which looked like it would annoy branch prediction units, which in theory he should have been taking into account. (I was on an Alpha in those days, and branch prediction was everything, so I was more hypersensitive to such bubbles. However, perhaps the code was good because the ifs were actually quite predictable; you'd need to try it out on a modern machine.)

                        Phil
                      • Phil Carmody
                        From: Jack Brennen ... Which is classic UB. Nacked-by: Phil
                        Message 11 of 12 , Jan 4, 2012
                          From: Jack Brennen <jfb@...>
                          > Apologies once again, but I don't
                          > want to leave broken code out there...

                          A noble aim. So why did you overlook this:

                          > >>     int64 d;
                          > >>       d&= d>>63;     //where 63+1=wordsize of uint64s

                          Which is classic UB.

                          Nacked-by: Phil
                        • Phil Carmody
                          From: WarrenS ... It s a sparse enough inner loop that I can easily imagine the increased dependency makes it slower. What s the latency
                          Message 12 of 12 , Jan 4, 2012
                            From: WarrenS <warren.wds@...>
                            > > > b -= d+d+a;
                            > > > a += d; //the obvious "optimization" of this
                            > & previous line... makes it slower!
                            > > > (...)
                            > >
                            > >
                            > > I can't imagine that
                            > > a += d ; b -= d+a
                            > > would be slower.
                            >
                            > --it is slower! On my computer, anyhow.

                            It's a sparse enough inner loop that I can easily imagine the increased dependency makes it slower. What's the latency of an add nowadays? I know it's crept up to about 6 in the past decade (at least on the SIMD units). Something like that's a huge bubble, and should definitely be avoided.

                            Phil
                          Your message has been successfully submitted and would be delivered to recipients shortly.