> This is almost the test 3.6.7 from C&P, but I can't figure out how you are choosing D=P^2-4*Q. It could be similar to Selfridge's choice or Pomerance's choice or Pari-gp's ispsedoprime() test,

--it is this.

> Paul Underwood

There is a sequence of odd numbers, I used -3, 5, -7, 9, -11...

but other sequences also work and which is the best is not clear,

and I have not tried many experiments, basically since my first experiment worked.

You might want to try other experiments to see which yields the fewest Lehmer

pseudoprimes.

I pick the first eligible D in that sequence and fix Q=2.

"Eligible" means jacobi( D*(D+8), n ) = -1.

What you are calling P^2 is then determined by D and Q, namely P^2=D+8.

Note P^2 is not necessarily a squared integer for me -- in fact usually will not be --

indeed it need not even be a square mod n.

What advantages accrue from this choice of D, P^2, and Q?

1. Q is not +1 and not -1, which both are good to avoid since empirically yield a lot of pseudoprimes.

2. Q's multiplicative inverse is obviously (n+1)/2 hence need not be computed.

3. I only ask for one jacobi condition to be satisfied for eligibility, not two.

Facts 1-3 keep the code simple and fast, not that it matters hugely.

And it turns out this is good enough to yield zero failures for n<2^64.

Were we to use basically the same algorithm for larger n>2^64, then

it becomes of interest to add a few extra bells and whistles to my Lehmer test

to decrease its number of pseudoprimes in the hope of making

the failure-free region for n, be as large as possible. This would be partly

an experimental effort. I have already done some experiments of that nature, but

I consider them too incomplete to talk about as yet. Identifying the first

failure-n for the combined test should be extremely difficult.

Even deciding whether any failure-n exist, also is likely to be difficult, and I suspect the

first will be at least 100 digits long.

The top reason all this code is so slow seems to be that my (and many) computers

have a slow as molasses "divide" instruction.

>

- --- On Mon, 3/18/13, WarrenS <warren.wds@...> wrote:
> Note my Lehmer test actually ran in "1.60 Selfridges"

...

> experimentally

> while(j>0){

For the fixed g==2 case, this is suboptimal, running at about 1.3x (31/24 = (3+7/8)/3) the cost it could have. And 1.3 * 1.6 = 2, as expected.

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> j -= 3;

> es = (e>>j)&7;

> //prefacing this line with "if(es)" is

> optional; the "if" gives 2-3% speedup on my machine:

>

> if(es){ A = mulmod64(A, gpow[es], m); }

> }

Use balanced residues (-p/2 .. +p/2), and your multiply/squaring can be accompanied by a doubling (multiplication by g=2) almost for free. Then at each single bit step you either do a mulmod() or a mulmoddouble().

Balanced residues work best when using the mixed floating-point/fixed-point mulmod algorithm. (Find a*b*(1/p) on the FPU, round to nearest, multiply by p in integers, subtract its bottom bits from the bottom bits of integer (a*b).)

Note, however, that if you're interested in speed, you'll want a sqmod() operation as well as a mulmod() operation, there's no point in creating unnecessary duplicates. A good C compiler might be able to perform this optimisation for you if all the functions are inline and written in a compiler-friendly way, but sometimes it's just best to keep things explicitly simple.

Phil

--

() ASCII ribbon campaign () Hopeless ribbon campaign

/\ against HTML mail /\ against gratuitous bloodshed

[stolen with permission from Daniel B. Cristofani] > For the fixed g==2 case, this is suboptimal,

--I also thought of the idea of making a special modular exponentiation routine

(B^x mod m) for the special case B=2, but using different ideas than Carmody suggested.

His ideas were based on a special doubling routine. Mine were based on bitshifting.

(He also muttered about both-sign integers and use of floating point, but I avoided those

ideas entirely.)

Turned out my approach is 3% faster than

Carmody's for 32-bit uint 2^x mod m based primetest, both unsurprisingly

were speedups versus the general-base routine.

Also my approach is 6% faster than Carmody's for 64 bits.

On my machine anyhow... one might be able to play with compiler and

loop unrolling and such to make his approach faster, they are close, or might be

better to hybridize both ideas.

Carmody also suggested making a special squaring routine, not just multiply.

This seems to yield a small (perhaps nonexistent?) speedup.

Hence 2^x mod m now runs in only about "0.68 selfridges" while B^x mod m

would be 1 selfridge if B>=3.

As a result, my 32-bit and 64-bit prime tests now both faster than Worley's 32-bit

test on average!

Indeed my 32-bit test currently 35% faster on average (random odd numbers) although

presumably 2X slower in worst case (primes only) than Worley's.