- --- On Mon, 3/18/13, WarrenS <warren.wds@...> wrote:
> Note my Lehmer test actually ran in "1.60 Selfridges"

...

> experimentally

> while(j>0){

For the fixed g==2 case, this is suboptimal, running at about 1.3x (31/24 = (3+7/8)/3) the cost it could have. And 1.3 * 1.6 = 2, as expected.

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> j -= 3;

> es = (e>>j)&7;

> //prefacing this line with "if(es)" is

> optional; the "if" gives 2-3% speedup on my machine:

>

> if(es){ A = mulmod64(A, gpow[es], m); }

> }

Use balanced residues (-p/2 .. +p/2), and your multiply/squaring can be accompanied by a doubling (multiplication by g=2) almost for free. Then at each single bit step you either do a mulmod() or a mulmoddouble().

Balanced residues work best when using the mixed floating-point/fixed-point mulmod algorithm. (Find a*b*(1/p) on the FPU, round to nearest, multiply by p in integers, subtract its bottom bits from the bottom bits of integer (a*b).)

Note, however, that if you're interested in speed, you'll want a sqmod() operation as well as a mulmod() operation, there's no point in creating unnecessary duplicates. A good C compiler might be able to perform this optimisation for you if all the functions are inline and written in a compiler-friendly way, but sometimes it's just best to keep things explicitly simple.

Phil

--

() ASCII ribbon campaign () Hopeless ribbon campaign

/\ against HTML mail /\ against gratuitous bloodshed

[stolen with permission from Daniel B. Cristofani] > For the fixed g==2 case, this is suboptimal,

--I also thought of the idea of making a special modular exponentiation routine

(B^x mod m) for the special case B=2, but using different ideas than Carmody suggested.

His ideas were based on a special doubling routine. Mine were based on bitshifting.

(He also muttered about both-sign integers and use of floating point, but I avoided those

ideas entirely.)

Turned out my approach is 3% faster than

Carmody's for 32-bit uint 2^x mod m based primetest, both unsurprisingly

were speedups versus the general-base routine.

Also my approach is 6% faster than Carmody's for 64 bits.

On my machine anyhow... one might be able to play with compiler and

loop unrolling and such to make his approach faster, they are close, or might be

better to hybridize both ideas.

Carmody also suggested making a special squaring routine, not just multiply.

This seems to yield a small (perhaps nonexistent?) speedup.

Hence 2^x mod m now runs in only about "0.68 selfridges" while B^x mod m

would be 1 selfridge if B>=3.

As a result, my 32-bit and 64-bit prime tests now both faster than Worley's 32-bit

test on average!

Indeed my 32-bit test currently 35% faster on average (random odd numbers) although

presumably 2X slower in worst case (primes only) than Worley's.