--- On Mon, 3/18/13, WarrenS <

warren.wds@...> wrote:

> Note my Lehmer test actually ran in "1.60 Selfridges"

> experimentally

...

> while(j>0){

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> A = mulmod64(A, A, m);

> j -= 3;

> es = (e>>j)&7;

> //prefacing this line with "if(es)" is

> optional; the "if" gives 2-3% speedup on my machine:

>

> if(es){ A = mulmod64(A, gpow[es], m); }

> }

For the fixed g==2 case, this is suboptimal, running at about 1.3x (31/24 = (3+7/8)/3) the cost it could have. And 1.3 * 1.6 = 2, as expected.

Use balanced residues (-p/2 .. +p/2), and your multiply/squaring can be accompanied by a doubling (multiplication by g=2) almost for free. Then at each single bit step you either do a mulmod() or a mulmoddouble().

Balanced residues work best when using the mixed floating-point/fixed-point mulmod algorithm. (Find a*b*(1/p) on the FPU, round to nearest, multiply by p in integers, subtract its bottom bits from the bottom bits of integer (a*b).)

Note, however, that if you're interested in speed, you'll want a sqmod() operation as well as a mulmod() operation, there's no point in creating unnecessary duplicates. A good C compiler might be able to perform this optimisation for you if all the functions are inline and written in a compiler-friendly way, but sometimes it's just best to keep things explicitly simple.

Phil

--

() ASCII ribbon campaign () Hopeless ribbon campaign

/\ against HTML mail /\ against gratuitous bloodshed

[stolen with permission from Daniel B. Cristofani]