Loading ...
Sorry, an error occurred while loading the content.
 

philosophical musing

Expand Messages
  • Campbell, John
    Hi One way to look at RISC is that if you can compile to microcode, your program will run faster. So if the tools hide the intricacy of programming at that
    Message 1 of 10 , Oct 25, 2002
      Hi

      One way to look at RISC is that if you can compile to microcode, your
      program
      will run faster. So if the tools hide the intricacy of programming at
      that level, thats
      all you need.

      A CPU programmed in a FPGA is always going to be handicapped in clock
      speed
      relative to a conventional microprocessor. Whats the best we can do
      currently?
      50MHz or so ? Pretty dismal against 2GHz for a current high end
      pentium.

      But what if instead of compiling to a pre-determined machine language,
      you generate
      a custom processor targeted at a single application? If the logic for
      large chunks of C
      code became the instructions of this single-use processor, the
      competitive tables might
      be turned.

      Thoughts anyone?
      -jc
    • ben franchuk
      ... See Below. ... Funny I thought that was what CISC computers were about. The fact that RISC machines seem faster is because serial acess of memory is faster
      Message 2 of 10 , Oct 25, 2002
        Campbell, John wrote:
        > Hi
        >
        > One way to look at RISC is that if you can compile to microcode, your
        > program
        > will run faster. So if the tools hide the intricacy of programming at
        > that level, thats
        > all you need.
        >
        > A CPU programmed in a FPGA is always going to be handicapped in clock
        > speed
        > relative to a conventional microprocessor. Whats the best we can do
        > currently?
        > 50MHz or so ? Pretty dismal against 2GHz for a current high end
        > pentium.
        See Below.

        > But what if instead of compiling to a pre-determined machine language,
        > you generate
        > a custom processor targeted at a single application? If the logic for
        > large chunks of C
        > code became the instructions of this single-use processor, the
        > competitive tables might
        > be turned.
        Funny I thought that was what CISC computers were about.
        The fact that RISC machines seem faster is because
        serial acess of memory is faster than random access.
        I have yet to find out the REAL bus speed of my Computer.
        A 2 Ghz machine still accesses a bus at the same speed as
        a .5 GHZ machine.(Ignoring Cache access speed)
        A slower FPGA cpu using the same speed bus could be faster
        if you have specialized instructions that are used more often.
        A FPGA is way to evalulate CPU designs before you 'burn' your
        idea into silicon.
      • Jan Gray
        ... The high end Pentium 4 approaches 3 GHz now. The ALUs are double pumped, so each one can do up to 6 Gops. And there are multiple ALUs. In practice, you
        Message 3 of 10 , Oct 25, 2002
          > A CPU programmed in a FPGA is always going to be handicapped
          > in clock speed relative to a conventional microprocessor.
          > Whats the best we can do currently? 50MHz or so ? Pretty
          > dismal against 2GHz for a current high end pentium.

          The high end Pentium 4 approaches 3 GHz now. The ALUs are double
          pumped, so each one can do up to 6 Gops. And there are multiple ALUs.
          In practice, you won't see anything like that. For a single cache miss
          that goes all the way out to main memory and is an open-page-miss in the
          DRAM, the latency could easily be 100 ns. That's 100000 ps / 333 ps =
          300 clock cycles or nearly a thousand potential issue slots. They don't
          call it the "memory wall" for nothing.

          The high end FPGA CPU is only ~150 MHz. But you can multiply
          instantiate them. I have an unfinished 16-bit design in 4x8 V-II CLBs
          that does about 167 MHz and includes a pipelined single-cycle
          multiply-accumulate. You can put 40 of them in a 2V1000 for a peak
          16-bit computation rate (never to exceed) of 333 Mops * 40 = ~12 Gops.
          In a monster 2VP100 or 2VP125 you're looking at up to 10X that -- over
          50 Gmacs (100 Gops). (Whether your problem can exploit that degree of
          parallelism, or whether the part can handle the power dissipation of
          such a design, I just don't know.)

          When the Pentium 4 goes to main memory, it takes 50-150 ns. When the
          FPGA CPU multiprocessor goes to main memory, it also takes 50-150 ns.
          If the problem doesn't fit in cache, the P4 does not look so good.

          Each P4 offers (with the help of a northbridge chipset) external
          bandwidth of 3.2 GB/s (64-bits at 100 MB/s-quad-pumped). Each 2V1000
          offers external bandwidth of at least 8 GB/s (e.g. go configure yourself
          four 133-MHz 64-bit (~105-pin) DDR-DRAM channels).

          When the Pentium 4 mispredicts a branch, it takes many, many (up to ~20)
          cycles to recover. When the FPGA CPU core takes a branch (or not), it
          wastes 0 or 1 cycles. If you are spending cycles parsing text, the
          random nature of the data can eliminate many of the benefits of a
          deeeeeeeeeeeeeeeeeep pipeline.

          If I had to run Office, I'd rather have a P4.

          If I had to classify XML data on the wire at wire speed, I'd rather have
          an FPGA MPSoC or a mesh of same.

          I think most of you will enjoy this lecture:
          http://abp.lcs.mit.edu/6.823/lectures/lecture21.pdf.



          > But what if instead of compiling to a pre-determined machine
          > language, you generate a custom processor targeted at a
          > single application? If the logic for large chunks of C code
          > became the instructions of this single-use processor, the
          > competitive tables might be turned.

          This isn't strictly to the question, but a long time ago when we were
          all writing and tuning p-code interpreters, the question of instruction
          set compression came up. Hey, why not tune the p-code instruction set
          for the application? If this application uses particular constants a
          lot, or a lot of this kind of function call, or even
          multi-syllable-instructions like push0-push0-call, then you could encode
          those sorts of things more efficiently in a single one-byte opcode.

          Back in those days memory was king, believe me. If you didn't fit into
          the 60K or 100K or 200K budget, you were toast. Ever swapped-in
          overlays from floppy disks?

          So anyway, you would have to look at total memory footprint. To the
          extent you optimized the instruction set to get the interpreted image
          down, you might unintentionally grow the p-code interpreter itself. At
          the very extreme would be a p-code interpreter optimized for running
          (your favorite application) with one 0-bit instruction -- "run app".
          However in that case the interpreter was just the application written in
          native code...

          So the moral is to consider the total footprint of the thing to be
          hosted PLUS the footprint of host itself. If speed is an issue, I think
          a scalar RISC is a good match for an FPGA. If speed is not an issue, a
          stack machine backed by BRAM is also a good match for an FPGA. If power
          is an issue ... .

          Jan Gray, Gray Research LLC
        • Campbell, John
          ... In a way you re right, in the sense of more powerful instructions provided by CISC. ... Partly. Also because it takes more engineering time to optimize a
          Message 4 of 10 , Oct 25, 2002
            > > One way to look at RISC is that if you can compile to
            > >microcode, your program will run faster. So if the
            > >tools hide the intricacy of programming at that level,
            > >thats all you need.
            > Funny I thought that was what CISC computers were about.

            In a way you're right, in the sense of more powerful
            instructions provided by CISC.

            > The fact that RISC machines seem faster is because
            > serial acess of memory is faster than random access.

            Partly. Also because it takes more engineering time to
            optimize a larger circuit (CISC CPU). And because RISC
            exposes the "microcode" level to optimization during
            compilation. And because shorter clock cycles and smaller
            chunks of useful work make for more uniform pipeline
            stages.

            > A slower FPGA cpu using the same speed bus could be faster
            > if you have specialized instructions that are used more often.

            A few months ago it was pointed out on this list that
            cascading operators can be significantly faster than
            interspersing latches between the operators. Because
            the latch has to wait for all of the bits to settle.

            So I'm thinking that maybe whole chunks of code could be
            cascaded, input to output. One "instruction" of the
            single purpose CPU could latch the end result of the
            casade.

            The hard part would be slicing things up in uniform pieces
            so that you could pipeline it. Otherwise, even implemented
            in hardware, it would be slower. (seems to me.)
            -jc
          • Jason Watkins
            ... Indeed... I ve been thinking lately about how people see async logic as a frightening thing, but potentially useful for power consumption, reduced pin
            Message 5 of 10 , Oct 26, 2002
              > A few months ago it was pointed out on this list that
              > cascading operators can be significantly faster than
              > interspersing latches between the operators. Because
              > the latch has to wait for all of the bits to settle.

              Indeed... I've been thinking lately about how people see async logic as a
              frightening thing, but potentially useful for power consumption, reduced pin
              count, etc. It seems to me that it might be useful from a peroformance
              standpoint at reclaiming some performance lost to unused clock window. After
              all, not everything on the critical path is going to be the same delay
              between latches, even if you're clocking as narrow as 4 fanout of 4's. I
              don't know enough about implimentation, especially of async logic though to
              do the math to figure out if it really does work that way, because the
              advantage would have to overcome the overhead of the handshake protocol.
            • Ben Franchuk
              ... I want big hungry power use :) Where is a TUBE logic FPGA when you need it. If I was designing as system I want constant power use rather than power
              Message 6 of 10 , Oct 26, 2002
                Jason Watkins wrote:

                > Indeed... I've been thinking lately about how people see async logic as a
                > frightening thing, but potentially useful for power consumption, reduced pin
                > count, etc. It seems to me that it might be useful from a peroformance
                > standpoint at reclaiming some performance lost to unused clock window. After
                > all, not everything on the critical path is going to be the same delay
                > between latches, even if you're clocking as narrow as 4 fanout of 4's. I
                > don't know enough about implimentation, especially of async logic though to
                > do the math to figure out if it really does work that way, because the
                > advantage would have to overcome the overhead of the handshake protocol.

                I want big hungry power use :)
                Where is a TUBE logic FPGA when you need it.

                If I was designing as system I want constant power use rather
                than power ranging from small to big. Any guess what the peek
                power of modern cpu is? 30 amps for a few pico seconds?

                The advantage with async logic I think is
                A) you can run at TOP SPEED what ever the current speed is at
                the momement. B) You have more error detection built in.
                From what little I have seen about modern CPU's, most of
                is memory rather alu's.
              • Eric Smith
                ... Not even close. Some of them draw 30 amps or more *continuously*.
                Message 7 of 10 , Oct 28, 2002
                  > If I was designing as system I want constant power use rather
                  > than power ranging from small to big. Any guess what the peek
                  > power of modern cpu is? 30 amps for a few pico seconds?

                  Not even close. Some of them draw 30 amps or more *continuously*.
                • Mike Butts
                  My ISP is changing their setup so my xr16 in JHDL (xr16vx) page will change. After Dec. 1 it s only at: http://users.easystreet.com/mbutts/xr16vx_jhdl.html
                  Message 8 of 10 , Oct 28, 2002
                    My ISP is changing their setup so my xr16 in JHDL (xr16vx)
                    page will change. After Dec. 1 it's only at:
                    http://users.easystreet.com/mbutts/xr16vx_jhdl.html

                    Thanks everyone!

                    --Mike
                  • ben franchuk
                    ... Ok I am off a BIG amount. I better stop looking at Z80 and 6502 and 6809 data sheets and go with something newer. :)
                    Message 9 of 10 , Oct 28, 2002
                      Eric Smith wrote:
                      >>If I was designing as system I want constant power use rather
                      >>than power ranging from small to big. Any guess what the peek
                      >>power of modern cpu is? 30 amps for a few pico seconds?
                      >
                      >
                      > Not even close. Some of them draw 30 amps or more *continuously*.

                      Ok I am off a BIG amount. I better stop looking at Z80 and 6502 and
                      6809 data sheets and go with something newer. :)
                    • Ben Franchuk
                      ... But has anybody designed a architecure that is easy to build user instructions from? I am thinking of something like plumbing where you have a bunch of
                      Message 10 of 10 , Jan 29, 2003
                        Campbell, John wrote:

                        > Partly. Also because it takes more engineering time to
                        > optimize a larger circuit (CISC CPU). And because RISC
                        > exposes the "microcode" level to optimization during
                        > compilation. And because shorter clock cycles and smaller
                        > chunks of useful work make for more uniform pipeline
                        > stages.
                        But has anybody designed a architecure that is easy
                        to build user instructions from? I am thinking of
                        something like plumbing where you have a bunch of
                        pipes,corners,t's and you configure them together
                        to give a custom instruction for your needs?
                        This is not microcoding, but more like a bunch of
                        instruction parts that fit together.
                        Ben.
                      Your message has been successfully submitted and would be delivered to recipients shortly.