- Apologies once again, but I don't want to leave broken code out there...

It should be:

s=0; if ((a&1)==0) {

s = __builtin_ctzll(a|b); b>>=s;

a>>= __builtin_ctzll(a);

}

On 1/3/2012 2:40 PM, Jack Brennen wrote:

> Is it faster if you replace this:

>

> for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }

> while((a&1)==0){ a>>= 1; }

>

> with this:

>

> if ((a&1)==0) {

> s = __builtin_ctz(a|b); b>>=s;

> a>>= __builtin_ctz(a);

> }

>

> (or the equivalent intrinsic for your compiler)?

>

> That would seem to take care of fast-pathing the cases where no

> adjustment is made because a is odd, while taking advantage of

> the CPU's ability to count trailing zeroes in a single instruction.

>

>

> On 1/3/2012 11:30 AM, WarrenS wrote:

>> Good luck trying to decrypt this code... it is the fastest

>> GCD code I have at present.

>>

>> //Binary bithacked algorithm, 260 nanosec on iMac 2.0 GHz.

>> //Warren D. Smith December 2011.

>> //For comparison on same machine x += a%b takes 19.5 nanosec.

>> uint64 gcd64(uint64 a, uint64 b){

>> int s;

>> int64 d;

>> if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x

>> for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }

>> while((a&1)==0){ a>>= 1; }

>> for(;;){

>> if( (b|a)< 32 ){ return( ((uint64)SmallGCDarray[a][b])<< s ); } //32x32 precomputed array

>> while((b&1)==0){ b>>=1; }

>> d = b; d -= a;

>> d&= d>>63; //where 63+1=wordsize of uint64s

>> b -= d+d+a;

>> a += d; //the obvious "optimization" of this& previous line... makes it slower!

>> b>>= 1;

>> if(b==0){ return( a<< s ); }

>> }

>> }

>>

>>

>>

>>

>> ------------------------------------

>>

>> Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com

>> The Prime Pages : http://www.primepages.org/

>>

>> Yahoo! Groups Links

>>

>>

>>

>>

>>

>

>

>

> ------------------------------------

>

> Unsubscribe by an email to: primenumbers-unsubscribe@yahoogroups.com

> The Prime Pages : http://www.primepages.org/

>

> Yahoo! Groups Links

>

>

>

>

> - From: WarrenS <warren.wds@...>
> > > b -= d+d+a;

It's a sparse enough inner loop that I can easily imagine the increased dependency makes it slower. What's the latency of an add nowadays? I know it's crept up to about 6 in the past decade (at least on the SIMD units). Something like that's a huge bubble, and should definitely be avoided.

> > > a += d; //the obvious "optimization" of this

> & previous line... makes it slower!

> > > (...)

> >

> >

> > I can't imagine that

> > a += d ; b -= d+a

> > would be slower.

>

> --it is slower! On my computer, anyhow.

Phil