## Re: How fast is your GCD code? Here's mine...

Expand Messages
• ... --thanks! That hack sped it up from 260 to 258 nanosec. You also need to add an s=0 to your code fragment otherwise s could be used without having been
Message 1 of 12 , Jan 3, 2012
--- In primenumbers@yahoogroups.com, Jack Brennen <jfb@...> wrote:
>
> Is it faster if you replace this:
>
> for(s=0; ((a|b)&1)==0; s++){ a>>= 1; b>>= 1; }
> while((a&1)==0){ a>>= 1; }
>
> with this:
>
> if ((a&1)==0) {
> s = __builtin_ctz(a|b); b>>=s;
> a >>= __builtin_ctz(a);
> }
>
> (or the equivalent intrinsic for your compiler)?
>
> That would seem to take care of fast-pathing the cases where no
> the CPU's ability to count trailing zeroes in a single instruction.

--thanks! That hack sped it up from 260 to 258 nanosec.
You also need to add an "s=0" to your code fragment otherwise s could be used
without having been zeroed.

Bu then I applied your same idea in the second place and that sped it up to 142
nanosec!!! New code:

//Binary bithacked algorithm, 142 nanosec on iMac 2.0 GHz.
//Warren D. Smith December 2011.
//For comparison on same machine x += a%b takes 19.5 nanosec.
uint64 gcd64(uint64 a, uint64 b){
int s;
int64 d;
if( (a==0) || (b==0) ){ return( a|b ); } //gcd(0,x)=x

s=0;
if ((a&1)==0) { //this code improved here by Jack Brennen
s = __builtin_ctz(a|b); b>>=s;
a >>= __builtin_ctz(a);
}

while( (b|a) > 31 ){
b >>= __builtin_ctz(b);
d = b; d -= a;
d &= d>>63; //where 63+1=wordsize of uint64s
b -= d+d+a;
a += d;
b >>= 1;
if(b==0){ return( a << s ); }
}
return( ((uint64)SmallGCDarray[a][b]) << s ); //32x32 precomputed array
}
• Yes, as per Brennan s bug fix, the ctz s should be ctzll s. That also speeds it up to 139 nanosec.
Message 2 of 12 , Jan 3, 2012
Yes, as per Brennan's bug fix, the ctz's should be ctzll's.

That also speeds it up to 139 nanosec.
• From: WarrenS ... I think Bob S posted his highly tuned code to sci.math, or just possibly the Mersenne Forum, about 5 years ago. It had special cases for
Message 3 of 12 , Jan 4, 2012
From: WarrenS
> Yes, as per Brennan's bug fix, the
> ctz's should be ctzll's.
>
> That also speeds it up to 139 nanosec.

I think Bob S posted his highly tuned code to sci.math, or just possibly the Mersenne Forum, about 5 years ago. It had special cases for removing small multiples of the smaller number, which looked like it would annoy branch prediction units, which in theory he should have been taking into account. (I was on an Alpha in those days, and branch prediction was everything, so I was more hypersensitive to such bubbles. However, perhaps the code was good because the ifs were actually quite predictable; you'd need to try it out on a modern machine.)

Phil
• From: Jack Brennen ... Which is classic UB. Nacked-by: Phil
Message 4 of 12 , Jan 4, 2012
From: Jack Brennen <jfb@...>
> Apologies once again, but I don't
> want to leave broken code out there...

A noble aim. So why did you overlook this:

> >>     int64 d;
> >>       d&= d>>63;     //where 63+1=wordsize of uint64s

Which is classic UB.

Nacked-by: Phil
• From: WarrenS ... It s a sparse enough inner loop that I can easily imagine the increased dependency makes it slower. What s the latency
Message 5 of 12 , Jan 4, 2012
From: WarrenS <warren.wds@...>
> > > b -= d+d+a;
> > > a += d; //the obvious "optimization" of this
> & previous line... makes it slower!
> > > (...)
> >
> >
> > I can't imagine that
> > a += d ; b -= d+a
> > would be slower.
>
> --it is slower! On my computer, anyhow.

It's a sparse enough inner loop that I can easily imagine the increased dependency makes it slower. What's the latency of an add nowadays? I know it's crept up to about 6 in the past decade (at least on the SIMD units). Something like that's a huge bubble, and should definitely be avoided.

Phil
Your message has been successfully submitted and would be delivered to recipients shortly.