Loading ...
Sorry, an error occurred while loading the content.

Re: Speedups

Expand Messages
  • KAMADA Makoto
    Hello, The following routine is the fastest mult64x64 on my PC so far. void mult64x64(u64 *c, u64 *a, u64 *b) { multnx64(c, a, b, 64); } When I improved multT,
    Message 1 of 26 , Aug 1, 2006
    • 0 Attachment
      Hello,

      The following routine is the fastest mult64x64 on my PC so far.

      void mult64x64(u64 *c, u64 *a, u64 *b) {
      multnx64(c, a, b, 64);
      }

      When I improved multT, MultB64, multnx64 and MultB_T64 at Greg's
      recommendation one year ago, I found that multnx64(c,a,b,64) was
      faster than mult64x64(c,a,b) on my PC. But I did not examine
      mult64x64 at that time because it was not the most time-consuming
      routine in matsolve.

      Of course, depending on weight, processor, clock speed or compilers,
      multnx64 is not always faster than all mult64x64 routines.

      The results of test1 including multnx64 on Pentium 4, 3.06GHz,
      Windows XP, Cygwin and gcc-3.4.4 are:

      #define _ASMx86_32
      #define _MMX_REGS
      #define _HW_BSFL
      >gcc -march=pentium4 -O3 -ffast-math -funroll-loops -finline-functions -fomit-frame-pointer -o test1 test1.c
      >test1
      Routine mult64x64_new2 takes 7 seconds.
      Routine mult64x64_new takes 9 seconds.
      Routine mult64x64_new1 takes 8 seconds.
      Routine mult64x64_relf2_k2 takes 7 seconds.
      Routine mult64x64_mmx takes 23 seconds.
      Routine mult64x64 takes 27 seconds.
      Routine multnx64 takes 5 seconds. <- best

      If the matrix A is light, the table generation stage of multnx64
      wastes time.

      matrix A: zeros=4096 ones=0 crc=0x0000
      matrix B: zeros=4096 ones=0 crc=0x0000
      generic: zeros=4096 ones=0 crc=0x0000 time= 1.56(microsec)
      mmx: zeros=4096 ones=0 crc=0x0000 time= 1.56(microsec)
      k1: zeros=4096 ones=0 crc=0x0000 time= 0(microsec) <- best
      k2: zeros=4096 ones=0 crc=0x0000 time= 0(microsec) <- best
      k3: zeros=4096 ones=0 crc=0x0000 time= 0.15(microsec)
      new: zeros=4096 ones=0 crc=0x0000 time= 0(microsec) <- best
      new1: zeros=4096 ones=0 crc=0x0000 time= 0.31(microsec)
      new2: zeros=4096 ones=0 crc=0x0000 time= 0(microsec) <- best
      multnx64: zeros=4096 ones=0 crc=0x0000 time= 2.65(microsec)

      If the matrix A is heavy, bit scan process wastes time.

      matrix A: zeros=0 ones=4096 crc=0xb441
      matrix B: zeros=0 ones=4096 crc=0xb441
      generic: zeros=4096 ones=0 crc=0x0000 time= 5.16(microsec)
      mmx: zeros=4096 ones=0 crc=0x0000 time= 2.35(microsec) <- best
      k1: zeros=4096 ones=0 crc=0x0000 time= 6.72(microsec)
      k2: zeros=4096 ones=0 crc=0x0000 time= 5(microsec)
      k3: zeros=4096 ones=0 crc=0x0000 time= 2.5(microsec)
      new: zeros=4096 ones=0 crc=0x0000 time= 5.62(microsec)
      new1: zeros=4096 ones=0 crc=0x0000 time= 6.09(microsec)
      new2: zeros=4096 ones=0 crc=0x0000 time= 5(microsec)
      multnx64: zeros=4096 ones=0 crc=0x0000 time= 2.66(microsec)

      If the matrix A has medium weight, branch prediction miss and the
      amount of MMX instructions decelerate the routine especially on
      Pentium 4.

      matrix A: zeros=2051 ones=2045 crc=0xe512
      matrix B: zeros=2059 ones=2037 crc=0x95e6
      generic: zeros=2047 ones=2049 crc=0x62ff time= 11.88(microsec)
      mmx: zeros=2047 ones=2049 crc=0x62ff time= 10.15(microsec)
      k1: zeros=2047 ones=2049 crc=0x62ff time= 4.06(microsec)
      k2: zeros=2047 ones=2049 crc=0x62ff time= 3.28(microsec)
      k3: zeros=2047 ones=2049 crc=0x62ff time= 5.47(microsec)
      new: zeros=2047 ones=2049 crc=0x62ff time= 3.59(microsec)
      new1: zeros=2047 ones=2049 crc=0x62ff time= 3.75(microsec)
      new2: zeros=2047 ones=2049 crc=0x62ff time= 3.29(microsec)
      multnx64: zeros=2047 ones=2049 crc=0x62ff time= 2.5(microsec) <- best

      Regards,
      Makoto Kamada
    • KAMADA Makoto
      Hello, Smaller version. ... void mult64x64_k5(u64 *c, const u64 *a, const u64 *b) { ALIGNED16(static u64 w[256]); int i; for (i = 0; i
      Message 2 of 26 , Aug 1, 2006
      • 0 Attachment
        Hello,

        Smaller version.

        ----------------8<----------------8<----------------8<----------------
        void mult64x64_k5(u64 *c, const u64 *a, const u64 *b) {
        ALIGNED16(static u64 w[256]);
        int i;
        for (i = 0; i < 64; i += 4) {
        u64 b0 = b[i];
        u64 b1 = b[i + 1];
        u64 b2 = b[i + 2];
        u64 t;
        w[4 * i] = 0;
        w[4 * i + 1] = t = b0;
        w[4 * i + 3] = t ^= b1;
        w[4 * i + 2] = t ^= b0;
        w[4 * i + 6] = t ^= b2;
        w[4 * i + 7] = t ^= b0;
        w[4 * i + 5] = t ^= b1;
        w[4 * i + 4] = t ^= b0;
        w[4 * i + 12] = t ^= b[i + 3];
        w[4 * i + 13] = t ^= b0;
        w[4 * i + 15] = t ^= b1;
        w[4 * i + 14] = t ^= b0;
        w[4 * i + 10] = t ^= b2;
        w[4 * i + 11] = t ^= b0;
        w[4 * i + 9] = t ^= b1;
        w[4 * i + 8] = t ^ b0;
        }
        for (i = 0; i < 64; i++) {
        u32 t = (u32)a[i];
        u32 u = (u32)(a[i] >> 32);
        c[i] = w[t & 15] ^
        w[16 + ((t >> 4) & 15)] ^
        w[32 + ((t >> 8) & 15)] ^
        w[48 + ((t >> 12) & 15)] ^
        w[64 + ((t >> 16) & 15)] ^
        w[80 + ((t >> 20) & 15)] ^
        w[96 + ((t >> 24) & 15)] ^
        w[112 + (t >> 28)] ^
        w[128 + (u & 15)] ^
        w[144 + ((u >> 4) & 15)] ^
        w[160 + ((u >> 8) & 15)] ^
        w[176 + ((u >> 12) & 15)] ^
        w[192 + ((u >> 16) & 15)] ^
        w[208 + ((u >> 20) & 15)] ^
        w[224 + ((u >> 24) & 15)] ^
        w[240 + (u >> 28)];
        }
        }
        ----------------8<----------------8<----------------8<----------------

        Results:

        >test1
        Routine mult64x64_new2 takes 7 seconds.
        Routine mult64x64_mmx takes 23 seconds.
        Routine mult64x64_k5 takes 3 seconds.

        >test_k5 2 100000 257
        matrix A: zeros=2051 ones=2045 crc=0xe512
        matrix B: zeros=2059 ones=2037 crc=0x95e6
        mmx: zeros=2047 ones=2049 crc=0x62ff time= 10.31(microsec)
        new2: zeros=2047 ones=2049 crc=0x62ff time= 3.28(microsec)
        k5: zeros=2047 ones=2049 crc=0x62ff time= 1.41(microsec)

        Regards,
        Makoto Kamada
      • ivan_seculab
        ... Dear KAMADA Makoto! Yes, u algoritm are best, i simple make implmentation for MMX regs ;-) File test1.c updated, ITERATOR = x*8 + CPU: Opteron 1.6, Linux
        Message 3 of 26 , Aug 1, 2006
        • 0 Attachment
          > Hello,
          >
          > Smaller version.
          > ...

          Dear KAMADA Makoto!

          Yes, u algoritm are best,
          i simple make implmentation for MMX regs ;-)
          File test1.c updated, ITERATOR = x*8

          + CPU: Opteron 1.6, Linux Slackware 10.x, 64bit mode.
          + GCC: 3.4.3

          # _ASMx86_32 - Disabled
          # _MMX_REGS - Disabled
          # _HW_BSFL - Disabled
          $> gcc -O9 -s -m64 -march=opteron -otest test1.c
          $> ./test
          Routine mult64x64_k5 takes 13 seconds.
          Routine mult64x64_k5h takes 13 seconds.
          Routine mult64x64_new2 takes 71 seconds.
          Routine mult64x64_mmx failed sanity check
          Routine mult64x64 takes 67 seconds.

          # _ASMx86_32 - Enabled
          # _MMX_REGS - Enabled
          # _HW_BSFL - Disabled
          $> gcc -O9 -s -m32 -march=opteron -otest test1.c
          $> ./test
          Routine mult64x64_k5 takes 23 seconds.
          Routine mult64x64_k5h takes 11 seconds.
          Routine mult64x64_new2 takes 64 seconds.
          Routine mult64x64_mmx takes 69 seconds.
          Routine mult64x64 takes 103 seconds.

          + CPU: P4-Centrino 1.6, WinXP HE
          + MSC: Version 14.00.50727.42 for 80x86

          # _ASMx86_32 - Enabled
          # _MMX_REGS - Enabled
          # _HW_BSFL - Enabled
          Routine mult64x64_k5 takes 23 seconds.
          Routine mult64x64_k5h takes 14 seconds.
          Routine mult64x64_new2 takes 41 seconds.
          Routine mult64x64_mmx takes 125 seconds.
          Routine mult64x64 takes 143 seconds.

          --
          Regards, Ivan
        • Anton Korobeynikov
          Hello Everyone ... Well. Ok. Let s do final comparison of mult64x64 routine speed on different CPUs. So, I m asking everyone to submit results obtained from
          Message 4 of 26 , Aug 2, 2006
          • 0 Attachment
            Hello Everyone

            Tue, 01 Aug 2006 15:26:23 -0000 you wrote:

            > Yes, u algoritm are best,
            > i simple make implmentation for MMX regs ;-)
            > File test1.c updated, ITERATOR = x*8
            Well. Ok. Let's do final comparison of mult64x64 routine speed on
            different CPUs.

            So, I'm asking everyone to submit results obtained from running test1.c
            source file.

            Thanks!

            --
            With best regards, Anton Korobeynikov.

            Faculty of Mathematics & Mechanics, Saint Petersburg State University.
          • Sten
            mult64x64_k5h() is definitely the best. 1) Machine-1. Pentium-M 735, Dothan, 1.7 GHz, L1 32 Kb, L2 2048 Kb gcc -O3 -march=pentium-m test1.c Routine
            Message 5 of 26 , Aug 2, 2006
            • 0 Attachment

              mult64x64_k5h() is definitely the best.


              1) Machine-1. Pentium-M 735, Dothan, 1.7 GHz, L1 32 Kb, L2 2048 Kb


              gcc -O3 -march=pentium-m test1.c


              Routine mult64x64_k5 takes     26 seconds.

              Routine mult64x64_k5h takes     13 seconds.

              Routine mult64x64_new2 takes     42 seconds.

              Routine mult64x64_mmx takes    126 seconds.

              Routine mult64x64 takes    194 seconds.


              gcc -Os -march=pentium-m test1.c


              Routine mult64x64_k5 takes     25 seconds.

              Routine mult64x64_k5h takes     12 seconds.

              Routine mult64x64_new2 takes     42 seconds.

              Routine mult64x64_mmx takes    130 seconds.

              Routine mult64x64 takes    190 seconds.


              2) Machine-2. Pentium-4, Northwood, 2.4 GHz, L1 8Kb, L2 512 Kb


              gcc -O3 -march=pentium4 test1.c


              Routine mult64x64_k5 takes     19 seconds.

              Routine mult64x64_k5h takes     11 seconds.

              Routine mult64x64_new2 takes     39 seconds.

              Routine mult64x64_mmx takes    114 seconds.

              Routine mult64x64 takes    154 seconds.


              gcc -Os -march=pentium4 test1.c


              Routine mult64x64_k5 takes     20 seconds.

              Routine mult64x64_k5h takes     13 seconds.

              Routine mult64x64_new2 takes     38 seconds.

              Routine mult64x64_mmx takes    116 seconds.

              Routine mult64x64 takes    161 seconds.


              3) Machine-3. AMD Athlon 64, NewCastle, 2.2 GHz, L1 64 Kb, L2 512 Kb


              Visual Studio 2005, x64 configuration (Win64),

              Full Optimizations (/Ox)


              #define ITERATOR 0x100000*14

              //#define _ASMx86_32 /* Please disable it for 64 bit mode binary.  */

              //#define _MMX_REGS


              Routine mult64x64_k5 takes     16 seconds.

              Routine mult64x64_k5h takes     15 seconds.

              Routine mult64x64_new2 takes    151 seconds.

              Routine mult64x64_mmx failed sanity check

              Routine mult64x64 takes     84 seconds.





              >

              Hello Everyone


              Tue, 01 Aug 2006 15:26:23 -0000 you wrote:


              > Yes, u algoritm are best,

              > i simple make implmentation for MMX regs ;-)

              > File test1.c updated, ITERATOR = x*8

              Well. Ok. Let's do final comparison of mult64x64 routine speed on

              different CPUs.


              So, I'm asking everyone to submit results obtained from running test1.c

              source file.


              Thanks!


              -- 

              With best regards, Anton Korobeynikov.


              Faculty of Mathematics & Mechanics, Saint Petersburg State University.

               





              -- 

              С уважением,

               Sten                          mailto:stenri@...

            • Brian Gladman
              ... It looks as if mult64x64_k5h is the best in my environments: 1.8GHz P4 (MSVC v8): Routine mult64x64_k5 takes 27 seconds. Routine mult64x64_k5h takes
              Message 6 of 26 , Aug 2, 2006
              • 0 Attachment
                Anton Korobeynikov wrote:

                > Hello Everyone
                >
                > Tue, 01 Aug 2006 15:26:23 -0000 you wrote:
                >
                >> Yes, u algoritm are best,
                >> i simple make implmentation for MMX regs ;-)
                >> File test1.c updated, ITERATOR = x*8
                > Well. Ok. Let's do final comparison of mult64x64 routine speed on
                > different CPUs.
                >
                > So, I'm asking everyone to submit results obtained from running test1.c
                > source file.

                It looks as if mult64x64_k5h is the best in my environments:

                1.8GHz P4 (MSVC v8):
                Routine mult64x64_k5 takes 27 seconds.
                Routine mult64x64_k5h takes 19 seconds.
                Routine mult64x64_new2 takes 48 seconds.
                Routine mult64x64_mmx takes 152 seconds.
                Routine mult64x64 takes 162 seconds.

                1.8GHz P4 (Intel v9.1):
                Routine mult64x64_k5 takes 23 seconds.
                Routine mult64x64_k5h takes 15 seconds.
                Routine mult64x64_new2 takes 228 seconds.
                Routine mult64x64_mmx takes 162 seconds.
                Routine mult64x64 takes 182 seconds.

                AMD64 4800 X2 32bit mode (MSVC v8):
                Routine mult64x64_k5 takes 12 seconds.
                Routine mult64x64_k5h takes 8 seconds.
                Routine mult64x64_new2 takes 80 seconds.
                Routine mult64x64_mmx takes 44 seconds.
                Routine mult64x64 takes 46 seconds.

                AMD64 4800 X2 64bit mode (MSVC v8):
                Routine mult64x64_k5 takes 8 seconds.
                Routine mult64x64_k5h takes 8 seconds.
                Routine mult64x64_new2 takes 88 seconds.
                Routine mult64x64_mmx failed sanity check
                Routine mult64x64 takes 45 seconds.

                AMD64 4800 X2 32bit mode (Intel v9.1):
                Routine mult64x64_k5 takes 11 seconds.
                Routine mult64x64_k5h takes 6 seconds.
                Routine mult64x64_new2 takes 181 seconds.
                Routine mult64x64_mmx takes 42 seconds.
                Routine mult64x64 takes 44 seconds.

                AMD64 4800 X2 64bit mode (Intel v9.1):
                Routine mult64x64_k5 takes 9 seconds.
                Routine mult64x64_k5h takes 8 seconds.
                Routine mult64x64_new2 takes 178 seconds.
                Routine mult64x64_mmx failed sanity check
                Routine mult64x64 takes 37 seconds.

                Brian Gladman
              • KAMADA Makoto
                Hello, Results of test1.c, including mult64x64_k6 which uses SSE2, on Pentium 4 3.06GHz, Windows XP, Cygwin and gcc-3.4.4: #define ITERATOR 0x100000*32 #define
                Message 7 of 26 , Aug 5, 2006
                • 0 Attachment
                  Hello,

                  Results of test1.c, including mult64x64_k6 which uses SSE2, on
                  Pentium 4 3.06GHz, Windows XP, Cygwin and gcc-3.4.4:

                  #define ITERATOR 0x100000*32

                  #define _ASMx86_32
                  #define _MMX_REGS
                  #define _HW_BSFL

                  >gcc -march=pentium4 -O3 -ffast-math -funroll-loops -finline-functions -fomit-frame-pointer -o test1 test1.c
                  >test1
                  Routine mult64x64_k5 takes 51 seconds.
                  Routine mult64x64_k5h takes 35 seconds.
                  Routine mult64x64_new2 takes 119 seconds.
                  Routine mult64x64_mmx takes 366 seconds.
                  Routine mult64x64 takes 427 seconds.
                  Routine mult64x64_k6 takes 28 seconds.

                  ----------------8<----------------8<----------------8<----------------
                  void mult64x64_k6 (u64 *c, const u64 *a, const u64 *b)
                  {
                  m64 w[16][16];
                  {
                  #if defined (_MMX_REGS) && defined (__SSE2__)
                  __m128i t, b1, b2, b3;
                  int i;
                  for (i = 0; i < 16; i++)
                  {
                  t = _mm_shuffle_epi32 (_mm_movpi64_epi64 (*(m64 *)&b[i * 4 ]),
                  _MM_SHUFFLE (1, 0, 3, 2));
                  b1 = _mm_shuffle_epi32 (_mm_movpi64_epi64 (*(m64 *)&b[i * 4 + 1]),
                  _MM_SHUFFLE (1, 0, 1, 0));
                  b2 = _mm_shuffle_epi32 (_mm_movpi64_epi64 (*(m64 *)&b[i * 4 + 2]),
                  _MM_SHUFFLE (1, 0, 1, 0));
                  b3 = _mm_shuffle_epi32 (_mm_movpi64_epi64 (*(m64 *)&b[i * 4 + 3]),
                  _MM_SHUFFLE (1, 0, 1, 0));
                  *(__m128i *)&w[i][ 0] = t;
                  *(__m128i *)&w[i][ 2] = t = _mm_xor_si128 (t, b1);
                  *(__m128i *)&w[i][ 6] = t = _mm_xor_si128 (t, b2);
                  *(__m128i *)&w[i][ 4] = t = _mm_xor_si128 (t, b1);
                  *(__m128i *)&w[i][12] = t = _mm_xor_si128 (t, b3);
                  *(__m128i *)&w[i][14] = t = _mm_xor_si128 (t, b1);
                  *(__m128i *)&w[i][10] = t = _mm_xor_si128 (t, b2);
                  *(__m128i *)&w[i][ 8] = _mm_xor_si128 (t, b1);
                  }
                  #else
                  const m64 z = m_clear ();
                  m64 tl, th, b1, b2, b3;
                  int i;
                  for (i = 0; i < 16; i++)
                  {
                  w[i][ 0] = z;
                  w[i][ 1] = th = *(m64 *)&b[i * 4 ];
                  w[i][ 2] = tl = b1 = *(m64 *)&b[i * 4 + 1];
                  w[i][ 3] = th = var64xor (th, b1);
                  w[i][ 6] = tl = var64xor (tl, b2 = *(m64 *)&b[i * 4 + 2]);
                  w[i][ 7] = th = var64xor (th, b2);
                  w[i][ 4] = tl = var64xor (tl, b1);
                  w[i][ 5] = th = var64xor (th, b1);
                  w[i][12] = tl = var64xor (tl, b3 = *(m64 *)&b[i * 4 + 3]);
                  w[i][13] = th = var64xor (th, b3);
                  w[i][14] = tl = var64xor (tl, b1);
                  w[i][15] = th = var64xor (th, b1);
                  w[i][10] = tl = var64xor (tl, b2);
                  w[i][11] = th = var64xor (th, b2);
                  w[i][ 8] = var64xor (tl, b1);
                  w[i][ 9] = var64xor (th, b1);
                  }
                  #endif
                  }
                  {
                  u32 al, ah;
                  m64 t;
                  int i;
                  for (i = 0; i < 64; i++)
                  {
                  al = (u32) a[i];
                  ah = (u32)(a[i] >> 32);
                  t = w[ 0][ al & 15];
                  t = mem64xor (t, &w[ 1][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 2][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 3][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 4][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 5][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 6][(al >>= 4) & 15]);
                  t = mem64xor (t, &w[ 7][ al >> 4 ]);
                  t = mem64xor (t, &w[ 8][ ah & 15]);
                  t = mem64xor (t, &w[ 9][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[10][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[11][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[12][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[13][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[14][(ah >>= 4) & 15]);
                  t = mem64xor (t, &w[15][ ah >> 4 ]);
                  c[i] = (u64)t;
                  }
                  }
                  m_empty ();
                  }
                  ----------------8<----------------8<----------------8<----------------

                  Regards,
                  Makoto Kamada
                • ivan_seculab
                  Hello KAMADA Makoto! test1.c updated. Please see my small bugfixes, if I not right, please correct me. Added ALIGN to 16 to all memory blocks, becouse SSE2
                  Message 8 of 26 , Aug 5, 2006
                  • 0 Attachment
                    Hello KAMADA Makoto!

                    test1.c updated.

                    Please see my small bugfixes, if I not right, please correct me.
                    Added ALIGN to 16 to all memory blocks, becouse SSE2 must have it.
                    For MSC corrected type cast.
                    Changed:
                    var64xor to vv64xor
                    mem64xor to vm64xor
                    Added:
                    mm64xor (mem to mem xor, reserved for future).

                    I think you implementation is final, i now try optimize other
                    functions.

                    my benchmarks.
                    + CPU: Opteron 1.6, Linux Slackware 10.x, 64bit mode.
                    + GCC: 3.4.3
                    # _ASMx86_32 - Disabled
                    # _MMX_REGS - Endabled
                    # _HW_BSFL - Disabled
                    $> gcc -O9 -s -m64 -march=opteron -otest test1.c
                    $> ./test
                    Routine mult64x64_k6 takes 45 seconds.
                    Routine mult64x64_k5h takes 47 seconds.
                    Routine mult64x64_k5 takes 50 seconds.

                    # _ASMx86_32 - Disabled
                    # _MMX_REGS - Disabled
                    # _HW_BSFL - Disabled
                    $> gcc -O9 -s -m64 -march=opteron -otest test1.c
                    $> ./test
                    Routine mult64x64_k6 takes 43 seconds.
                    Routine mult64x64_k5h takes 52 seconds.
                    Routine mult64x64_k5 takes 50 seconds.

                    # _ASMx86_32 - Enabled
                    # _MMX_REGS - Enabled
                    # _HW_BSFL - Disabled
                    $> gcc -O9 -s -m32 -march=opteron -otest test1.c
                    $> ./test
                    Routine mult64x64_k6 takes 42 seconds.
                    Routine mult64x64_k5h takes 45 seconds.
                    Routine mult64x64_k5 takes 94 seconds.

                    + CPU: P4-Centrino 1.6, WinXP HE
                    + MSC: Version 14.00.50727.42 for 80x86
                    # _ASMx86_32 - Enabled
                    # _MMX_REGS - Enabled
                    # _HW_BSFL - Enabled
                    $> cl -Oxt -arch:SSE2 -GFLy test1.c
                    $> test.exe
                    Routine mult64x64_k6 takes 67 seconds.
                    Routine mult64x64_k5h takes 58 seconds.
                    Routine mult64x64_k5 takes 88 seconds.

                    # _ASMx86_32 - Enabled
                    # _MMX_REGS - Enabled
                    # _HW_BSFL - Enabled
                    $> cl -Oxt -GFLy test1.c
                    $> test.exe
                    Routine mult64x64_k6 takes 66 seconds.
                    Routine mult64x64_k5h takes 59 seconds.
                    Routine mult64x64_k5 takes 88 seconds.

                    + CPU: P4-Centrino 1.6, (Linux VMWare)
                    + GCC 3.3.6
                    # _ASMx86_32 - Enabled
                    # _MMX_REGS - Enabled
                    # _HW_BSFL - Enabled
                    $> gcc -O3 -march=pentium4 -ffast-math -funroll-loops -finline-
                    functions -omit-frame-pointer -otest test1.c
                    $> ./test
                    Routine mult64x64_k6 takes 58 seconds.
                    Routine mult64x64_k5h takes 53 seconds.
                    Routine mult64x64_k5 takes 109 seconds.


                    --
                    With best Regards, Ivan.
                  Your message has been successfully submitted and would be delivered to recipients shortly.