## Infallible prime test for n<2^64

Expand Messages
• I succeeded in programming such a test. It is a combination of a base-2 strong pseudoprime test and a certain Lehmer test. No multiprecision math libraries
Message 1 of 7 , Mar 15, 2013
• 0 Attachment
I succeeded in programming such a test. It is a combination of a
base-2 strong pseudoprime test and a certain Lehmer test.
No multiprecision math libraries needed. I'll try to
give the details later. It has a lot of similarity to the BPSW test.
If run on the 31894014 odd strong pseudoprimes
below 2^64, it correctly identifies them
all as both composite and spsp(2). The runtime is
258 or 259 sec if only the Lehmer test is run (two runs timed)
164 sec if only the spsp(2) test is run
395 sec if both tests are run (which you need for infallibility)
from which I deduce that processing the input & other extraneous stuff
took 27 or 28 seconds, so that the speed was
231 sec for Lehmer only (138K tests/sec)
137 sec for spsp(2) only (233K tests/sec)
368 sec for combined test (87K tests/sec)
on 2GHz apple imac core duo, while also running other stuff, this was not the sole job.
Note my Lehmer test actually ran in "1.69 Selfridges" experimentally, i.e.
231/137=1.69, which somewhat surprised me and indicates my (unfinished)
attempt to build a 3-selfridge infallible test for n<2^64 using spsp tests only, was a waste of time, although I currently believe that test will be possible and will be found
once my computer grinds enough.

This also tells us that if my "faster than Eratosthenes" prime generator (explained in an earlier post) were run, it would generate primes at a rate of 87000*c primes output per second, for some constant c with 0<c<1. This is slower than available primesieve software by a factor of 500 to 1000 for primes up to 10^13, so I think my new prime generator, even though asymptotically superior to Eratosthenes, in practice will be a lot slower unless there is special purpose (e.g. pipelined) hardware. With said hardware, who knows, but I suspect my prime generator would be winning versus all rival methods.
Probably nobody will have the money to design & build a good hardware implementation, though, in which case that victory would be irrelevant to the real world.

--
Warren D. Smith
• ... Please tell us what your Lehmer test is. ... Not very scientific, if your machine started indexing the disk for example. ... 1.69 sounds good but can you
Message 2 of 7 , Mar 15, 2013
• 0 Attachment
--- In primenumbers@yahoogroups.com, "WarrenS" <warren.wds@...> wrote:
>
> I succeeded in programming such a test. It is a combination of a
> base-2 strong pseudoprime test and a certain Lehmer test.
> No multiprecision math libraries needed. I'll try to
> give the details later. It has a lot of similarity to the BPSW test.
> If run on the 31894014 odd strong pseudoprimes
> below 2^64, it correctly identifies them
> all as both composite and spsp(2). The runtime is
> 258 or 259 sec if only the Lehmer test is run (two runs timed)
> 164 sec if only the spsp(2) test is run
> 395 sec if both tests are run (which you need for infallibility)
> from which I deduce that processing the input & other extraneous stuff
> took 27 or 28 seconds, so that the speed was
> 231 sec for Lehmer only (138K tests/sec)
> 137 sec for spsp(2) only (233K tests/sec)
> 368 sec for combined test (87K tests/sec)

> on 2GHz apple imac core duo, while also running other stuff, this was not the sole job.

Not very scientific, if your machine started indexing the disk for example.

> Note my Lehmer test actually ran in "1.69 Selfridges" experimentally, i.e.
> 231/137=1.69, which somewhat surprised me and indicates my (unfinished)

1.69 sounds good but can you be more precise?

> attempt to build a 3-selfridge infallible test for n<2^64 using spsp tests only, was a waste of time, although I currently believe that test will be possible and will be found
> once my computer grinds enough.
>

You might want t use GPGPU:
http://www.gpgpgpu.com/gecco2009/6.pdf

> This also tells us that if my "faster than Eratosthenes" prime generator (explained in an earlier post) were run, it would generate primes at a rate of 87000*c primes output per second, for some constant c with 0<c<1. This is slower than available primesieve software by a factor of 500 to 1000 for primes up to 10^13, so I think my new prime generator, even though asymptotically superior to Eratosthenes, in practice will be a lot slower unless there is special purpose (e.g. pipelined) hardware. With said hardware, who knows, but I suspect my prime generator would be winning versus all rival methods.
> Probably nobody will have the money to design & build a good hardware implementation, though, in which case that victory would be irrelevant to the real world.
>
>
> --
> Warren D. Smith
> http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)
>

Paul
• OK, I will now attempt to describe my tests and the (revised) timing info. 32-BIT TESTS SPEED (gcc4.2.1 on 2GHz iMac): Run on the 5*10^7 odd numbers
Message 3 of 7 , Mar 17, 2013
• 0 Attachment
OK, I will now attempt to describe my tests and the (revised) timing info.

32-BIT TESTS SPEED (gcc4.2.1 on 2GHz iMac):
Run on the 5*10^7 odd numbers 1..99999999,
WorleyIsPrime32() runs in 63 seconds.
SmithIsPrime32() runs in 73 seconds.
CombinedIsPrime64() runs in 76 seconds.
Although I'd expected the speeds to be ordered the way they are, I had not expected
these numbers -- I'd expected Worley to be almost twice as fast as Smith,
but experimentally it is only about 16% faster. All correctly find 5761455 primes
(where I'm now also including 2=prime).

64-BIT TESTS:
If CombinedIsPrime64() is run on the 31894014 odd strong pseudoprimes
below 2^64, it correctly identifies them
all as both composite and spsp(2). The runtime is
259 sec if only the Lehmer subtest is run
164 sec if only the spsp(2) subtest is run
414 sec if both tests are run (which you need for infallibility)
from which I deduce that processing the input & other extraneous stuff
took 6 seconds, so that the speed was
253 sec for Lehmer only
158 sec for spsp(2) only
408 sec for combined test
on 2GHz apple imac core duo, while also running other stuff, this was not the
sole job (which has been correctly criticized as "not very scientific" but the environments were all similar).
Note my Lehmer test actually ran in "1.60 Selfridges" experimentally, i.e.
253/158=1.60, which somewhat surprised me and indicates my (unfinished)
attempt to build a 3-selfridge infallible test for n<2^64 using spsp tests only,
was a waste of time, although I currently believe that test will be possible and
will be found once my computer grinds enough, and the machine timings
keep surprising me so maybe it won't be a waste of time.

Here are C code excerpts.

//This can be sped up (no loop needed) on many processors using
//"count trailing zeros" instruction and barrel-shifter.
#define evenize(e,s) for(s=0; (e&1)==0; e/=2){ s++; }

#define JACTABSZ 512
static uint JACTABCONSTRUCTED = FALSE;
int8 jactab[JACTABSZ*(JACTABSZ+3)/2];
static int tabkron2[] = {0,1,0,-1, 0,-1,0,1};

//Assumes b is odd. Computes Legendre-Jacobi symbol (a|b). jacobi(a,1)=1.
//Otherwise: if a==0 then jacobi(a,b)=0, else jacobi(a,b)=jacobi(a, b%(4*abs(a))),
//which allows preliminary reduction for b much larger than can fit in one uint.
//And if b>0 then jacobi(a,b)=jacobi(a%b, b) which allows preliminary reduction
//for a much larger than can fit in one int.
int jacobi(int a, uint b){
int v,r,j;
if((b&1)==0){ return(0); } //b must be odd.
if(JACTABCONSTRUCTED and b<JACTABSZ and abs(a)<=b){ //Table lookup
r = b*(b-1)/2;
return( jactab[r+a+b] );
}
if(b==1){ return(1); }

j=1;
for(;;){
assert(b&1);
if(a==0){ if(b>1){ return(0); } return(j); }
if(a<0){ a = -a; if((b&3)==3){ j = -j; }}
evenize(a,v);
if(v&1){ j *= tabkron2[b&7]; }
if(a&b&2){ j = -j; }
r=a; a = b%r; b=r;
if(a>r/2){ a -= r; } //this line optional
}
}

void buildjactab(){ //builds initial jacobi lookup table so future jacobi(a,b)'s can be fast
int a,b,k;
JACTABCONSTRUCTED = FALSE;
for(b=1; b<JACTABSZ; b+=2){ for(a= -b; a<=b; a++){
k = b*(b-1)/2;
jactab[k+a+b] = jacobi(a,b);
}}
JACTABCONSTRUCTED = TRUE;
}

inline uint64 mulmod64(uint64 a, uint64 b, uint64 c){ //returns d = a*b mod c
uint64 d; //Intel 64-bit assembly code
asm ("mov %1, %%rax;" /* put a into rax */
"mul %2;" /* mul a*b -> rdx:rax */
"div %3;" /* (a*b)/c -> quot in rax, remainder in rdx */
"mov %%rdx, %0;" /* store result in d */
:"=r"(d) /* output to d */
:"r"(a), "r"(b), "r"(c) /* input */
:"%rax", "%rdx" /* clobbered registers */
);
return( d );
}

inline uint64 submod64(uint64 a, uint64 b, uint64 c){ //returns d = a-b mod c
if(a>=b){ return(a-b); }
return((c-b)+a);
}

inline uint32 mulmod32(uint32 a, uint32 b, uint32 c){ //returns d = a*b mod c
uint32 d;
uint64 q;
q = a; q *= b; d = q%c; return(d);
}

//computes g^e mod m.
uint64 modularexp64(uint64 g, uint64 e, uint64 m){
uint64 A, gpow[8];
int j;
uint es;
if(e<=1){ if(e==0){ return(1); } return(g); }
gpow[0] = 1;
gpow[1] = g;
//with 4-bit table: 14 widemuls to compute table then 16 to compute modular exp, total=30.
//with 3-bit table: 6 widemuls to compute table then 22 to compute modular exp, total=28.
//(In both cases beyond the 63 squarings that need to happen regardless.)
//Hence using 3-bit table, presumed faster.
for(j=2; j<8; j++){
gpow[j] = mulmod64(g, gpow[j-1], m);
}
for(j=63; j>=0; j-=3){ //could speed up with assembly "count leading zeros" instruction.
es = e>>j;
if(es!=0) break;
}
A = gpow[es&7];
while(j>0){
A = mulmod64(A, A, m);
A = mulmod64(A, A, m);
A = mulmod64(A, A, m);
j -= 3;
es = (e>>j)&7;
//prefacing this line with "if(es)" is optional; the "if" gives 2-3% speedup on my machine:
if(es){ A = mulmod64(A, gpow[es], m); }
}
return(A);
}

//computes g^e mod m.
uint32 modularexp32(uint32 g, uint32 e, uint32 m){
uint32 A, gpow[8];
int j;
uint es;
if(e<=1){ if(e==0){ return(1); } return(g); }
gpow[0] = 1;
gpow[1] = g;
//with 3-bit table: 6 widemuls to compute table then 11 to compute modular exp, total=17.
//with 2-bit table: 2 widemuls to compute table then 16 to compute modular exp, total=18.
//(In both cases beyond the 31 squarings that need to happen regardless.)
//Hence using 3-bit table, presumed faster.
for(j=2; j<8; j++){
gpow[j] = mulmod32(g, gpow[j-1], m);
}
for(j=30; j>=0; j-=3){ //could speed up with assembly "count leading zeros" instruction.
es = e>>j;
if(es!=0) break;
}
A = gpow[es&7];
while(j>0){
A = mulmod32(A, A, m);
A = mulmod32(A, A, m);
A = mulmod32(A, A, m);
j -= 3;
es = (e>>j)&7;
//prefacing this line with "if(es)" is optional; the "if" gives 3% slowdown on my machine:
A = mulmod32(A, gpow[es], m);
}
return(A);
}

//Phil Carmody & I concluded after conversing that strongpsp(p) should return,
//not merely TRUE / FALSE, but instead the sqrt(-1) mod p that it finds.
//If it fails to find such a sqrt (but still p is probable prime),
//returns 1. If it proves p composite, returns 0.
//Hence TRUE (i.e. nonzero) if p is (Miller-Rabin) strong pseudoprime to base B.
// The advantage of this is that if we find two square roots of -1 (mod p) namely r1 and r2,
//and if neither r1==r2 nor r1+r2==0 (mod p) then p must be composite; indeed a factor would be
//gcd(p, (r1-r2)*(r1+r2)). If p==1 (mod 4) this allows user
//to "wring more juice" out of spsp tests than a mere TRUE/FALSE output
//would have allowed. Assumes p>B>=2 and p=even.
//(p<B can also be ok, but you better know what you are doing.)
uint64 StrongPsp64(uint64 p, uint B){
int s;
uint64 A, pm1, e;
assert(B>1 and (p&1)); //assumes p is odd and p>B>=2...
pm1 = p-1; e = pm1;
evenize(e,s);
A = modularexp64((uint64)B, e, p);
if(A==1){ return(1); }
e=1;
for(;;){
if(A == pm1){ return(e); }
s--; if(s==0){ return(0); }
e=A; A=mulmod64(e,e,p);
}
}

uint32 StrongPsp32(uint32 p, uint B){
int s;
uint32 A, pm1, e;
assert(B>1 and (p&1)); //assumes p is odd and p>B>=2...
//p<B can also be ok but you better know what you are doing
pm1 = p-1; e = pm1;
evenize(e,s);
A = modularexp32((uint32)B, e, p);
if(A==1){ return(1); }
e=1;
for(;;){
if(A == pm1){ return(e); }
s--; if(s==0){ return(0); }
e=A; A=mulmod32(e,e,p);
}
}

uint hash63778tab[128] = {//these 50 numbers are the only nonprime odd spsp{2,63778} below 2^32:
979363153, 0, 0, 643552909, 0, 8725753, 0, 0, 0, 0, 329153653, 48191653, 0, 8036033, 0,
3116107, 0, 0, 839280691, 0, 450866021, 0, 0, 0, 0, 0, 0, 2269093, 0, 0, 0, 0, 0, 2693739751,
0, 2717428033, 1464568381, 0, 0, 0, 0, 1714322377, 1140573601, 0, 0, 63318169, 0, 0,
4157008813, 0, 0, 220729, 0, 0, 0, 0, 0, 3207297773, 0, 0, 0, 385319089, 1930534453, 0, 0,
0, 1054999441, 787085857, 0, 0, 2017509601, 0, 0, 967287751, 1272866167, 0, 3548378341,
3274264033, 0, 0, 0, 2206095589, 2766006253, 453366029, 176030977, 126132553, 0, 0, 0, 0,
3242533897, 0, 1121176981, 0, 2181961, 0, 0, 17208601, 0, 0, 0, 0, 48316969, 741751,
724969087, 0, 0, 0, 43661257, 0, 0, 0, 0, 0, 0, 1153164097, 1104194521, 0, 3323829169, 0,
765378241, 22591301, 1894909141, 3900327241, 60547831, 0, 1328256247, 0 };

//Speed, on 2GHz intel, gcc 4.2.1, is about 25 minutes to test 2^30 consecutive integers...
//which I must say I consider disappointing.
uint SmithIsPrime32(uint32 p){ //TRUE iff p is a prime and <=32 bits long.
uint hash;
if((p&1)==0){ return(p==2); }
if( !StrongPsp32(p, 2) ){ return(FALSE); }
if(p<2047){ return(TRUE); }
if( !StrongPsp32(p, 63778) ){ return(FALSE); }
hash = ((p*2869299540U)>>25)&127;
if(hash63778tab[hash]==p){ return(FALSE); }
return(TRUE);
}

uint8 WorleySPRPlookup[1024]={
5,6,11,15,76,15,11,26,10,35,29,13,
22,6,14,3,21,24,30,13,6,11,12,6,11,11,5,11,19,3,13,
14,3,6,17,26,6,13,13,13,2,6,5,2,11,15,7,3,10,37,43,
18,33,5,37,13,5,3,37,14,26,5,6,15,19,45,11,24,2,6,
23,2,11,10,2,15,5,5,15,15,15,55,5,92,10,5,18,10,2,
66,22,21,2,42,13,5,17,3,2,6,2,3,2,13,3,14,14,42,6,7,
39,6,15,3,24,5,3,11,14,22,11,5,26,22,14,2,28,6,19,7,
7,10,2,45,40,11,31,2,11,17,33,12,23,2,14,10,13,15,
13,6,3,3,3,38,7,23,18,7,6,10,10,5,11,5,6,2,2,11,6,
12,21,5,43,6,11,15,21,55,38,13,2,2,17,10,13,13,13,7,
10,5,13,3,29,14,11,23,18,5,2,7,2,14,10,2,2,20,10,7,
18,7,6,5,3,10,15,57,18,10,26,12,13,6,17,6,3,3,10,13,
15,44,35,30,26,13,2,13,2,31,37,5,10,31,2,15,5,7,17,
33,17,2,21,23,7,5,17,6,40,17,11,56,30,30,10,3,15,11,
2,15,3,5,2,2,17,22,2,10,2,7,12,13,5,22,5,20,17,7,6,
18,18,6,19,6,5,21,24,2,10,2,2,11,19,42,2,5,17,7,2,42,
7,41,5,5,23,15,11,2,34,10,24,11,12,14,11,2,15,30,28,
19,6,10,12,5,6,45,7,13,11,12,15,13,11,13,2,23,2,33,
15,15,11,10,24,26,13,31,13,7,11,23,5,12,17,14,7,29,
29,15,22,19,14,11,15,19,2,15,2,28,13,20,14,15,60,5,
2,12,69,24,20,7,18,2,13,5,6,10,14,6,24,3,11,23,5,13,
88,13,14,19,3,5,3,21,2,11,6,31,42,18,73,2,7,3,2,3,23,
11,6,6,5,6,14,11,2,11,7,10,7,38,39,5,15,2,19,5,7,22,
13,7,11,24,10,20,14,3,2,10,15,3,13,15,29,11,10,3,6,5,
30,13,26,6,6,17,29,10,19,15,19,21,46,6,17,3,14,3,2,6,
17,26,34,2,11,5,29,19,3,2,19,7,11,14,29,15,7,33,14,21,
12,7,17,15,5,5,7,14,35,21,15,11,15,3,18,3,11,17,7,21,
20,3,15,11,10,3,24,47,10,5,10,5,3,42,6,2,10,21,3,2,2,
5,14,7,3,2,14,15,11,5,12,13,15,15,5,11,10,6,5,7,6,19,
5,2,6,15,5,6,3,2,24,10,10,11,14,11,7,13,3,7,3,18,20,
30,33,17,30,17,3,35,5,5,17,46,6,26,6,20,40,5,10,12,7,
17,5,12,2,6,5,10,5,7,7,7,3,17,11,10,34,10,15,2,6,3,11,
23,13,6,2,18,14,10,7,6,10,7,10,77,6,17,10,5,5,18,17,5,
15,2,11,3,10,13,28,21,20,23,6,7,14,11,3,6,18,5,5,10,6,
23,3,3,2,6,2,15,6,10,5,15,21,7,6,3,7,12,5,3,13,17,20,
21,15,7,2,14,40,20,5,11,7,7,10,10,31,11,13,21,43,3,14,
22,10,22,23,2,10,10,3,7,3,20,7,5,15,7,10,2,30,52,10,
56,14,14,2,2,3,7,5,5,10,3,10,7,15,13,3,89,21,11,5,14,
13,13,11,40,23,21,6,3,20,14,24,6,2,28,6,5,5,6,5,20,7,
13,29,26,31,5,11,3,24,5,7,5,29,38,3,19,13,6,2,22,10,7,
22,5,29,7,11,6,3,2,37,7,7,6,19,12,2,91,7,2,6,56,52,7,
11,10,31,2,6,11,78,34,15,2,53,14,22,31,22,41,2,23,6,
39,6,31,15,15,10,7,30,2,19,6,43,11,38,6,6,3,26,7,10,
11,5,2,11,5,22,22,3,15,2,5,6,24,10,17,7,10,11,2,23,13,
5,11,11,6,21,18,3,21,10,21,2,11,5,7,44,43,31,10,13,3,
2,10,3,7,2,47,13,12,2,13,5,11,2,5,7,11,19,5,10,17,22,
26,2,2,7,2,2,6,6,3,6,22,15,13,57,13,12,2,20,3,5,13,3,
13,15,6,15,2,21,11,10,11,15,2,22,23,10,6,44,7,2,10,5,
31,6,2,5,22,15,5,17,11,2,6,6,13,23,6,5,11,7,28,29,22,
2,7,11,6,35,24,2,3,2,17,13,3,6,2,14,5,11,11,7,23,13,11,
2,13,13,10,12,23,10,18,5,60,10,5,26,5,29,3,20,7,21,47
};

//The following is based on Steve P. Worley 2009--i.e. he computed above hash table--
//except I've replaced his StrongPsp32 routine with my superior code and repaired the bug
//that Worley's code thought 3 and 43 were nonprime.
//(One could probably shrink Worley's hash table if one did a lot of computing.)
uint WorleyIsPrime32(uint32 p){ //TRUE iff p is a prime and <=32 bits long.
if((p&1)==0){ return(p==2); }
if(p<2047){ return( StrongPsp32(p,2) ); }
return( StrongPsp32(p, (uint)WorleySPRPlookup[(((uint32)986094133)*p)>>22]) );
}

//TRUE iff n is Lehmer-Smith psueodoprime (includes all primes,
// and some, but hopefully relatively few, composites).
// D.H. Lehmer: An extended theory of Lucas' functions, Ann. Math. 31,3 (1930) 419-448
//revived+extended E.Lucas' theory of psuedoprimality testing in particular devising the famous
//necessary+sufficient "Lucas-Lehmer criterion" for the primality of a Mersenne number,
//which runs in "1 Selfridge" worth of time. The present code by Warren D. Smith (March 2013)
//runs in "2 Selfridge" worth of time asymptotically in theory, but in 1.60 Selfridges time
//experimentally tested on my machine. It is applicable to all odd numbers n
//(not merely those of Mersenne form) and is only necessary, but not sufficient, for n's primality.
//It ultimately also is based on Lehmer's same theory, although explaining exactly
//how it arises is a bit intricate. My goal in designing was to achieve best known speed
//for this family of tests while not needing any explicit modular-inverse computations and
//only one Jacobi symbol demand not two. If combined with a StrongPsp(2) test, this test
//is FLAWLESS for odd n<2^64.
uint CleanLehmerSmithPsp64(uint64 n){
uint64 a4,C,e,A0,A1;
int j,es,D;
assert(n&1);
//Step 1: find odd D so that jacobi(D*(D+8), n)=-1:
for( D = -3; ; D = ((D>0)?-D-2:2-D) ){ //fails??
j = D*(D+8);
a4 = 4*abs(j);
j = jacobi(j, (int)(n % a4));
if(j<=0){
if(j==0){ return(FALSE); }
break;
}
}
assert(abs(D)&1);
//Step 2: Compute e = (n+1)/2:
e = n/2; e++;
//Step 3: Compute C = (D+8)*(n+1)/2 - 2 (mod n):
if(D < -8){ C = -8-D; C = n-C; }else{ C = 8+D; }
C = mulmod64(e, C, n);
C = submod64(C, 2ULL, n);
//We assume/assert that n=odd, D=odd, and that
//|D| is small enough so that we need not worry about integer overflow.
for(j=63; j>=0; j--){ //Can speed up with assembly "count leading zeros" instruction.
es = e>>j; if(es!=0) break;
} //Step 4: Lehmer test main computation:
A0 = 2; A1 = C;
while(j>=0){
es = e>>j;
if(es&1){ //unlike for modular exponentiation, both the 0 and 1 cases take same time
A0 = mulmod64(A0, A1, n); A0 = submod64(A0, C, n);
A1 = mulmod64(A1, A1, n); A1 = submod64(A1, 2ULL, n);
}else{ //hence any attempt (like for mod.exp.) to go to, e.g, radix-8, is pointless.
A1 = mulmod64(A0, A1, n); A1 = submod64(A1, C, n);
A0 = mulmod64(A0, A0, n); A0 = submod64(A0, 2ULL, n);
}
j--;
}
if( mulmod64(C, A0, n) != mulmod64(2, A1, n) ){ return(FALSE); }
return(TRUE); //In place of this line, it also is
//possible to add more code here which takes almost no extra time but
//would improve the truth rate. But for 64-bit purposes we already get zero failures
//so to hell with it.
}

uint CombinedIsPrime64(uint64 n){
if((n&1)==0){ return(n==2); }
if( !StrongPsp64(n,2) ){ return(FALSE); }
if( n<2047 ){ return(TRUE); }
return( CleanLehmerSmithPsp64(n) );
}
• ... This is almost the test 3.6.7 from C&P, but I can t figure out how you are choosing D=P^2-4*Q. It could be similar to Selfridge s choice or Pomerance s
Message 4 of 7 , Mar 17, 2013
• 0 Attachment
--- In primenumbers@yahoogroups.com, "WarrenS" <warren.wds@...> wrote:

> } //Step 4: Lehmer test main computation:
> A0 = 2; A1 = C;
> while(j>=0){
> es = e>>j;
> if(es&1){ //unlike for modular exponentiation, both the 0 and 1 cases take same time
> A0 = mulmod64(A0, A1, n); A0 = submod64(A0, C, n);
> A1 = mulmod64(A1, A1, n); A1 = submod64(A1, 2ULL, n);
> }else{ //hence any attempt (like for mod.exp.) to go to, e.g, radix-8, is pointless.
> A1 = mulmod64(A0, A1, n); A1 = submod64(A1, C, n);
> A0 = mulmod64(A0, A0, n); A0 = submod64(A0, 2ULL, n);
> }
> j--;
> }
> if( mulmod64(C, A0, n) != mulmod64(2, A1, n) ){ return(FALSE); }
> return(TRUE); //In place of this line, it also is
> //possible to add more code here which takes almost no extra time but
> //would improve the truth rate. But for 64-bit purposes we already get zero failures
> //so to hell with it.
> }
>

This is almost the test 3.6.7 from C&P, but I can't figure out how you are choosing D=P^2-4*Q. It could be similar to Selfridge's choice or Pomerance's choice or Pari-gp's ispsedoprime() test,

Paul
• ... --it is this. There is a sequence of odd numbers, I used -3, 5, -7, 9, -11... but other sequences also work and which is the best is not clear, and I have
Message 5 of 7 , Mar 17, 2013
• 0 Attachment
> This is almost the test 3.6.7 from C&P, but I can't figure out how you are choosing D=P^2-4*Q. It could be similar to Selfridge's choice or Pomerance's choice or Pari-gp's ispsedoprime() test,
> Paul Underwood

--it is this.
There is a sequence of odd numbers, I used -3, 5, -7, 9, -11...
but other sequences also work and which is the best is not clear,
and I have not tried many experiments, basically since my first experiment worked.
You might want to try other experiments to see which yields the fewest Lehmer
pseudoprimes.

I pick the first eligible D in that sequence and fix Q=2.
"Eligible" means jacobi( D*(D+8), n ) = -1.

What you are calling P^2 is then determined by D and Q, namely P^2=D+8.
Note P^2 is not necessarily a squared integer for me -- in fact usually will not be --
indeed it need not even be a square mod n.

What advantages accrue from this choice of D, P^2, and Q?
1. Q is not +1 and not -1, which both are good to avoid since empirically yield a lot of pseudoprimes.
2. Q's multiplicative inverse is obviously (n+1)/2 hence need not be computed.
3. I only ask for one jacobi condition to be satisfied for eligibility, not two.
Facts 1-3 keep the code simple and fast, not that it matters hugely.
And it turns out this is good enough to yield zero failures for n<2^64.

Were we to use basically the same algorithm for larger n>2^64, then
it becomes of interest to add a few extra bells and whistles to my Lehmer test
to decrease its number of pseudoprimes in the hope of making
the failure-free region for n, be as large as possible. This would be partly
an experimental effort. I have already done some experiments of that nature, but
I consider them too incomplete to talk about as yet. Identifying the first
failure-n for the combined test should be extremely difficult.
Even deciding whether any failure-n exist, also is likely to be difficult, and I suspect the
first will be at least 100 digits long.

The top reason all this code is so slow seems to be that my (and many) computers
have a slow as molasses "divide" instruction.

>
• ... For the fixed g==2 case, this is suboptimal, running at about 1.3x (31/24 = (3+7/8)/3) the cost it could have. And 1.3 * 1.6 = 2, as expected. Use balanced
Message 6 of 7 , Mar 18, 2013
• 0 Attachment
--- On Mon, 3/18/13, WarrenS <warren.wds@...> wrote:
> Note my Lehmer test actually ran in "1.60 Selfridges"
> experimentally
...
>   while(j>0){
>     A = mulmod64(A, A, m);
>     A = mulmod64(A, A, m);
>     A = mulmod64(A, A, m);
>     j -= 3;
>     es = (e>>j)&7;
>     //prefacing this line with "if(es)" is
> optional; the "if" gives 2-3% speedup on my machine:
>
>     if(es){ A = mulmod64(A, gpow[es], m); }
>   }

For the fixed g==2 case, this is suboptimal, running at about 1.3x (31/24 = (3+7/8)/3) the cost it could have. And 1.3 * 1.6 = 2, as expected.

Use balanced residues (-p/2 .. +p/2), and your multiply/squaring can be accompanied by a doubling (multiplication by g=2) almost for free. Then at each single bit step you either do a mulmod() or a mulmoddouble().

Balanced residues work best when using the mixed floating-point/fixed-point mulmod algorithm. (Find a*b*(1/p) on the FPU, round to nearest, multiply by p in integers, subtract its bottom bits from the bottom bits of integer (a*b).)

Note, however, that if you're interested in speed, you'll want a sqmod() operation as well as a mulmod() operation, there's no point in creating unnecessary duplicates. A good C compiler might be able to perform this optimisation for you if all the functions are inline and written in a compiler-friendly way, but sometimes it's just best to keep things explicitly simple.

Phil
--
() ASCII ribbon campaign () Hopeless ribbon campaign
/\ against HTML mail /\ against gratuitous bloodshed

[stolen with permission from Daniel B. Cristofani]
• ... --I also thought of the idea of making a special modular exponentiation routine (B^x mod m) for the special case B=2, but using different ideas than
Message 7 of 7 , Mar 18, 2013
• 0 Attachment
> For the fixed g==2 case, this is suboptimal,

--I also thought of the idea of making a special modular exponentiation routine
(B^x mod m) for the special case B=2, but using different ideas than Carmody suggested.
His ideas were based on a special doubling routine. Mine were based on bitshifting.

ideas entirely.)

Turned out my approach is 3% faster than
Carmody's for 32-bit uint 2^x mod m based primetest, both unsurprisingly
were speedups versus the general-base routine.
Also my approach is 6% faster than Carmody's for 64 bits.
On my machine anyhow... one might be able to play with compiler and
loop unrolling and such to make his approach faster, they are close, or might be
better to hybridize both ideas.

Carmody also suggested making a special squaring routine, not just multiply.
This seems to yield a small (perhaps nonexistent?) speedup.

Hence 2^x mod m now runs in only about "0.68 selfridges" while B^x mod m
would be 1 selfridge if B>=3.

As a result, my 32-bit and 64-bit prime tests now both faster than Worley's 32-bit
test on average!
Indeed my 32-bit test currently 35% faster on average (random odd numbers) although
presumably 2X slower in worst case (primes only) than Worley's.
Your message has been successfully submitted and would be delivered to recipients shortly.