Loading ...
Sorry, an error occurred while loading the content.
 

VLIW processors anyone ?

Expand Messages
  • rtfinch35
    I want to do a VLIW processor in an FPGA, but I m not confident what I m doing. Does anyone have links to sample (educational) VLIW processor designs ? I ve
    Message 1 of 14 , Mar 11, 2002
      I want to do a VLIW processor in an FPGA, but I'm not confident what
      I'm doing. Does anyone have links to sample (educational) VLIW
      processor designs ? I've searched the net but can't seem to find
      anything.

      Current design: http://www.birdcomputer.ca/blackbird.htm

      Thanks
      Rob
    • Ben Franchuk
      ... Why not a micro-code designed cpu? Since internal memory is faster than external memory why not go back to CISC s style architecture rather than say a RISC
      Message 2 of 14 , Mar 11, 2002
        rtfinch35 wrote:
        >
        > I want to do a VLIW processor in an FPGA, but I'm not confident what
        > I'm doing. Does anyone have links to sample (educational) VLIW
        > processor designs ? I've searched the net but can't seem to find
        > anything.
        >
        > Current design: http://www.birdcomputer.ca/blackbird.htm

        Why not a micro-code designed cpu? Since internal memory is faster than
        external memory why not go back to CISC's style architecture rather than
        say a RISC emulator. The instruction set could be raw micro-code (like
        risc)
        or one of 3 user configured tables. A fourth table is used for MMU/OS
        stuff.
        Here would be a nice CPU to emulate.
        http://www.spies.com/~aek/alto/index.html
        --
        Ben Franchuk - Dawn * 12/24 bit cpu *
        www.jetnet.ab.ca/users/bfranchuk/index.html
      • Jan Gray
        ... INTRODUCTION TO VLIW VLIW ::= very long instruction word: a machine with one instruction pointer (how s that, Reinoud? :-) ) that selects a long
        Message 3 of 14 , Mar 11, 2002
          > I want to do a VLIW processor in an FPGA, but I'm not confident what
          > I'm doing. Does anyone have links to sample (educational) VLIW
          > processor designs ? I've searched the net but can't seem to find
          > anything.

          INTRODUCTION TO VLIW

          VLIW ::= very long instruction word: a machine with one instruction
          pointer (how's that, Reinoud? :-) ) that selects a long instruction word
          that issues a plurality of operations to a plurality of function units
          each cycle. In a VLIW system, the compiler schedules the multiple
          operations into instructions; in contrast, in a superscalar RISC, the
          instruction issue hardware schedules one or more scalar instructions to
          issue at the same time. You may also see the term LIW vs. VLIW. I
          think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine.
          In some VLIWs, to improve code density, a variable number of operations
          can be issued each cycle.

          There's a rich literature on VLIWs (that I don't purport to have read),
          but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW
          Architectures.

          The IA64 EPIC "explicitly parallel instruction computing" architecture
          is the most notable VLIW derivative. Several DSPs like the TI C6x are
          LIWs. A famous early VLIW was the Multiflow Trace family, a
          supercomputer startup that lived approximately 1984-1990.

          The challenging part of a VLIW project is the compiler. Unless you're a
          glutton for tricky assembly programming, or have a very small and
          specific problem, it's hardly worth designing the hardware if you don't
          have a way to compile code to use the parallelism presented by the
          hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow
          you to pour C code through a compiler and simulator and configure the
          number and kind of function units to best trade off performance and area
          against the specific computation kernel you care about.



          Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.

          To summarize my opinions, is a 2-issue machine is straightforward and
          even worthwhile, whereas (depending upon computation kernel) a 4+ issue
          machine may well bog down in its multiplexers and therefore may not make
          the best use of FPGA resources.

          INSTRUCTION FETCH AND ISSUE

          No sweat. Store instructions in a block RAM based instruction cache.
          Each BRAM in Virtex derived architectures can read 2x16=32 bits each
          cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4
          BRAMs and you can easily fetch 100-200 bits of instruction word each
          cycle.

          Keeping the I-cache fed is another matter, left as an exercise.

          REGISTER FILE DESIGN

          The key design problem is the register file organization. If you are
          going to issue n operations each cycle, you must fetch 1-2*n operands
          and retire up to n results each cycle. That implies a 1-2*n-read
          n-write per cycle register file.

          As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU
          operation requires about T ns, then you can read or write to a single
          port LUT RAM (or read AND write to a dual port LUT RAM) in about T/2 ns.
          (In Virtex-II, T is about 5 ns).

          Let's discuss an n=4 operation machine. It is a challenge to retire 4
          results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons
          to 2T, you can sequentially retire all four results in 4*T/2 ns, and
          read operands in parallel using dual port LUT RAM.

          Thus in a fully general organization, where any of the four results can
          retire to any four registers, "it can't be done" in T ns as defined
          above.

          Instead, consider a partitioned register file design. For instance,
          imagine a 64-register machine partitioned into 4 banks of 16 registers
          each. Each cycle, one result can be retired into each bank. That is
          easily targeted to a LUT RAM implementation. Indeed, you can fix the
          machine such that each issue slot is associated with one write port on
          the register file.

          Say bank m is assigned registers i, i div 4 == m, with 0<=i<64, 0<=m<4.
          We can easily issue four independent instructions such as
          r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50

          One $64 design question, properly answerable only with a good compiler
          at your side, is how to then arrange the read ports. At one extreme, if
          each of the four banks is entirely isolated from the other, (like a
          left-brain/right-brain patient with their corpus collosum cut), then you
          only need 2 read ports on each bank. In this organization, if you need
          a register or two from another bank, you would typically burn an issue
          slot in that sourcing bank to read the registers, and maybe burn another
          slot in the destination bank to save the registers. (Alternately in the
          destination bank you can directly mux it into the dependent operation
          register).

          At the other extreme, if any of the four banks can with full generality
          read operands from any of the other four banks, e.g.

          r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1
          r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24; // cycle 2
          r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40; // cycle 3
          r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4

          Then each of the four banks would need to provide 8 read ports, and each
          of the 8 operand multiplexers (in front of the four ALUs) would need at
          least a 4-1 mux.

          You can of course provide all these ports with replicating or
          multi-cycling the register file LUT RAMs, but it won't be pretty.

          In practice, I believe that the expected degree of cross-bank register
          reading is limited. Maybe you typically only need 3 read ports per
          bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other
          slots' accesses. But you have to have a scheduling compiler to help you
          make these tradeoffs.

          By the way, in my 1994 brain dump
          (http://fpgacpu.org/usenet/homebrew.html), with the sketch of the NVLIW1
          2-issue LIW, I used exactly this partitioned register file technique,
          combining two banks of 3r-1w register files. For each issue slot, one
          operand was required to be a register in that bank; but the other
          operand could be a register in either bank.

          Another option is to combine multibanked registers with some heavily
          multiported registers. For instance, assume each issue slot can read or
          write 16 bank registers and 16 shared registers, subject across issue
          slots to some maximum number of shared register file read and write port
          accesses each cycle. The designers of the Sun MAJC used a similar
          approach.

          EXECUTION UNIT

          Assume a design with 4 ALUs, one branch unit, one memory port.

          Loads/stores: Perhaps you limit generality and only allow loads/stores
          to issue from a given slot (and then shuttle those results to other slot
          register banks using the above multiported reg file description). The
          general alternative, allowing any slot to issue loads/stores, requires
          you to mux any ALU output to be the ea, and a per-bank mux to allow the
          MDR input to be the result.

          Result forwarding (register file bypass): if an operation in slot i,
          cycle 1, is permitted with full generality, to consume the result of an
          operation from slot j, cycle 0, then you need n copies of an n:1 result
          forwarding mux. Again, this is very expensive, so you will be sorely
          tempted to reduce the generality of result forwarding, or eliminate it
          entirely. Again, this is a design tradeoff your compiler must help you
          select.

          CONTROL FLOW

          You want to reduce the number of branches. In average C code you can
          expect a branch every 5 (rule of thumb) generic RISC instructions or so.
          In a 4 issue machine you will spend your life branching.

          Instead, you will want to use some form of predicated execution. Some
          computations will set predicate bits that represent the outcomes of
          conditional tests. Then other instructions' execution (result
          write-back and other side effects) will be predicated upon these
          specified predicate bits.

          In this organization, it seems reasonable to allow any issue slot to
          establish predicates that can be used by any other issue slot; but for
          simplicity you will only need and want one, or perhaps two, of the issue
          slots to be able to issue (predicated) branch and jump instructions.

          You will want a compiler to help you design the specific details of the
          predication system.

          IN SUMMARY

          The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW
          in an FPGA. That's just stamping out more instances of the i-cache +
          execution unit core slice. Whether the resulting machine runs much
          faster than a much simpler 2-issue LIW is critically dependent upon your
          compiler technology.


          ASIDE: WM

          As an aside, I'll just briefly mention Wulf's WM machine
          (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
          many interesting features, but of particular interest here, each execute
          instruction is of the form
          rd = (rs1 op rs2) op rs3 ,
          an explicitly parallel representation that affords some additional
          parallelism over the canonical scalar RISC, but which does not increase
          the number of write ports required on the register file.

          Jan Gray
          Gray Research LLC
        • Jan Gray
          ... I see this design reflects many of the issues in my last message. I should read first, write second! Jan Gray Gray Research LLC
          Message 4 of 14 , Mar 11, 2002
            > Current design: http://www.birdcomputer.ca/blackbird.htm

            I see this design reflects many of the issues in my last message. I
            should read first, write second!

            Jan Gray
            Gray Research LLC
          • rtfinch35
            ... you re a ... don t ... I m planning on using the Transmeta approach. I can write a simple virtual instruction set to real instruction set translater. At
            Message 5 of 14 , Mar 11, 2002
              >
              > The challenging part of a VLIW project is the compiler. Unless
              you're a
              > glutton for tricky assembly programming, or have a very small and
              > specific problem, it's hardly worth designing the hardware if you
              don't
              >

              I'm planning on using the Transmeta approach. I can write a simple
              virtual instruction set to real instruction set translater. At first
              I'm aiming for a direct 1:1 translation that won't provide any
              speedup (in fact it'll slow things down because of the wider
              instruction word). As the translater gets better, performance will
              improve. (Perhaps I could use the xr16 ISA as the virtual instruction
              set? :)

              >
              > Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.
              >
              > To summarize my opinions, is a 2-issue machine is straightforward
              and
              > even worthwhile, whereas (depending upon computation kernel) a 4+
              issue
              > machine may well bog down in its multiplexers and therefore may not
              make
              > the best use of FPGA resources.
              >

              I quickly put together a datapath and it looks like the datapath
              takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
              course this is without all the other extras needed to make it work,
              and assuming I got it right (a pretty big assumption). The thing that
              appears to slow it down is actually the barrel shifter which I may
              get rid of by limiting shifts to 0 to 3 bits at once. I think I can
              get a peak instruction completion rate of five instructions per cycle
              (I admit a little contrived) but it makes for great marketing. (eg.
              1) a compare (target = condition codes), 2) a store (no target reg),
              3) an effective address calc (target = special addr reg) 4) a branch,
              5) some alu operation 6) maybe ? Completing five instructions per
              cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
              can't I ?


              > INSTRUCTION FETCH AND ISSUE
              >
              > No sweat. Store instructions in a block RAM based instruction >

              Planning on using a 96 bit (or possibly 128 bit) wide i-cache, with a
              32 bit memory interface.


              > and retire up to n results each cycle. That implies a 1-2*n-read
              > n-write per cycle register file.

              Um, I think you need about half as many write ports because only
              about 50% of instructions actually update registers.


              > CONTROL FLOW
              >
              > Instead, you will want to use some form of predicated execution.

              I was thinking of using branch prediction.

              > ASIDE: WM
              >
              > As an aside, I'll just briefly mention Wulf's WM machine
              > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with

              I tried to access some of the docs but the links seem to be broken :(

              > instruction is of the form
              > rd = (rs1 op rs2) op rs3 ,

              If the operations are stacked like this, wouldn't it increase the
              cycle time of the processor ?

              Rob
            • Jan Gray
              ... Be our guest... ... What matters: * wall clock time * total energy consumed by the computation * FPGA area * dev tools ease of use. What doesn t matter: *
              Message 6 of 14 , Mar 12, 2002
                > I'm planning on using the Transmeta approach. I can write a simple
                > virtual instruction set to real instruction set translater. At first
                > I'm aiming for a direct 1:1 translation that won't provide any
                > speedup (in fact it'll slow things down because of the wider
                > instruction word). As the translater gets better, performance will
                > improve. (Perhaps I could use the xr16 ISA as the virtual instruction
                > set? :)

                Be our guest...

                > I quickly put together a datapath and it looks like the datapath
                > takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
                > course this is without all the other extras needed to make it work,
                > and assuming I got it right (a pretty big assumption). The thing that
                > appears to slow it down is actually the barrel shifter which I may
                > get rid of by limiting shifts to 0 to 3 bits at once. I think I can
                > get a peak instruction completion rate of five instructions per cycle
                > (I admit a little contrived) but it makes for great marketing. (eg.
                > 1) a compare (target = condition codes), 2) a store (no target reg),
                > 3) an effective address calc (target = special addr reg) 4) a branch,
                > 5) some alu operation 6) maybe ? Completing five instructions per
                > cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
                > can't I ?

                What matters:
                * wall clock time
                * total energy consumed by the computation
                * FPGA area
                * dev tools ease of use.

                What doesn't matter:
                * aggregate MIPS

                :-) See e.g. http://fpgacpu.org/usenet/array.html for a (useless) 10^13
                ops/s machine.

                > > and retire up to n results each cycle. That implies a 1-2*n-read
                > > n-write per cycle register file.

                > Um, I think you need about half as many write ports because only
                > about 50% of instructions actually update registers.

                Certainly but it depends upon whether and how you handle the worst case
                cases, whether you have multiplexers between ALUs and write port(s),
                etc.

                > > CONTROL FLOW
                > >
                > > Instead, you will want to use some form of predicated execution.
                >
                > I was thinking of using branch prediction.

                That will help. So you only fetch useless instructions on a mispredict?

                > > ASIDE: WM
                > >
                > > As an aside, I'll just briefly mention Wulf's WM machine
                > > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
                >
                > I tried to access some of the docs but the links seem to be broken :(

                Hmm. I have the ACM SIGARCH paper on it around here somewhere. The two
                big architectural ideas are the two ops per instruction and the implicit
                streaming memory ports exposed as special purpose registers.

                > > instruction is of the form
                > > rd = (rs1 op rs2) op rs3 ,
                >
                > If the operations are stacked like this, wouldn't it increase the
                > cycle time of the processor ?

                No and no.

                First, IIRC, WM does the second op in a second pipeline stage.

                Second, as I first learned during the Sun SuperSPARC presentation at Hot
                Chips III (1991) (I think it was), the delay through two ALUs (a+b)+c is
                not necessarily the same as two times the delay through one ALU (a+b),
                since there is concurrency -- lower bits of (a+b) can be added to lower
                bits of c even as upper bits of (a+b) are still ripple-carry-settling.

                <Pause while I go and run some tests.>

                OK, here's are two quick and dirty test circuits, sans floorplanning:

                module onealu(clk, i, o);
                input clk, i;
                output o;
                reg o;

                reg [31:0] a, b, sum;
                reg z;

                always @(posedge clk) begin
                {a,b} <= {a,b,i};
                sum <= a+b;
                z <= sum == 0;
                o <= z;
                end
                endmodule


                module twoalus(clk, i, o);
                input clk, i;
                output o;
                reg o;

                reg [31:0] a, b, c, sum;
                reg z;

                always @(posedge clk) begin
                {a,b,c} <= {a,b,c,i};
                sum <= (a+b)+c;
                z <= sum == 0;
                o <= z;
                end
                endmodule

                'onealu' approximates the delay from pipeline registers, through an
                adder, and back into pipeline registers.
                'twoalu' approximates the same thing, but through two adders.

                Here are the TRCE numbers for Virtex-II -5:

                onealu:
                Slack: -0.237ns (requirement - (data path - negative
                clock skew))
                Source: b[0]
                Destination: sum[31]
                Requirement: 4.500ns
                Data Path Delay: 4.737ns (Levels of Logic = 18)
                Negative Clock Skew: 0.000ns
                Source Clock: clk_c rising at 0.000ns
                Destination Clock: clk_c rising at 4.500ns
                Timing Improvement Wizard
                Data Path: b[0] to sum[31]
                Location Delay type Delay(ns) Physical Resource
                Logical
                Resource(s)
                -------------------------------------------------
                -------------------
                L3.IQ1 Tiockiq 0.598 i
                b[0]
                SLICE_X0Y0.F1 net (fanout=2) 0.516 b[0]
                SLICE_X0Y0.COUT Topcyf 0.755 sum[0]
                sum_1_axb_0
                sum_1_cry_0
                sum_1_cry_1
                SLICE_X0Y1.CIN net (fanout=1) 0.000 sum_1_cry_1/O
                SLICE_X0Y1.COUT Tbyp 0.092 sum[2]
                SLICE_X0Y2.CIN net (fanout=1) 0.000 sum_1_cry_3/O
                SLICE_X0Y2.COUT Tbyp 0.092 sum[4]
                SLICE_X0Y3.CIN net (fanout=1) 0.000 sum_1_cry_5/O
                SLICE_X0Y3.COUT Tbyp 0.092 sum[6]
                SLICE_X0Y4.CIN net (fanout=1) 0.000 sum_1_cry_7/O
                SLICE_X0Y4.COUT Tbyp 0.092 sum[8]
                SLICE_X0Y5.CIN net (fanout=1) 0.000 sum_1_cry_9/O
                SLICE_X0Y5.COUT Tbyp 0.092 sum[10]
                SLICE_X0Y6.CIN net (fanout=1) 0.000 sum_1_cry_11/O
                SLICE_X0Y6.COUT Tbyp 0.092 sum[12]
                SLICE_X0Y7.CIN net (fanout=1) 0.000 sum_1_cry_13/O
                SLICE_X0Y7.COUT Tbyp 0.092 sum[14]
                SLICE_X0Y8.CIN net (fanout=1) 0.000 sum_1_cry_15/O
                SLICE_X0Y8.COUT Tbyp 0.092 sum[16]
                SLICE_X0Y9.CIN net (fanout=1) 0.000 sum_1_cry_17/O
                SLICE_X0Y9.COUT Tbyp 0.092 sum[18]
                SLICE_X0Y10.CIN net (fanout=1) 0.000 sum_1_cry_19/O
                SLICE_X0Y10.COUT Tbyp 0.092 sum[20]
                SLICE_X0Y11.CIN net (fanout=1) 0.000 sum_1_cry_21/O
                SLICE_X0Y11.COUT Tbyp 0.092 sum[22]
                SLICE_X0Y12.CIN net (fanout=1) 0.000 sum_1_cry_23/O
                SLICE_X0Y12.COUT Tbyp 0.092 sum[24]
                SLICE_X0Y13.CIN net (fanout=1) 0.000 sum_1_cry_25/O
                SLICE_X0Y13.COUT Tbyp 0.092 sum[26]
                SLICE_X0Y14.CIN net (fanout=1) 0.000 sum_1_cry_27/O
                SLICE_X0Y14.COUT Tbyp 0.092 sum[28]
                SLICE_X0Y15.CIN net (fanout=1) 0.000 sum_1_cry_29/O
                SLICE_X0Y15.Y Tciny 1.257 sum[30]
                sum_1_cry_30
                sum_1_s_31
                SLICE_X0Y15.DY net (fanout=1) 0.001 sum_1[31]
                SLICE_X0Y15.CLK Tdyck 0.322 sum[30]
                sum[31]
                -------------------------------------------------
                ---------------------------
                Total 4.737ns (4.220ns logic,
                0.517ns route)
                (89.1% logic,
                10.9% route)

                twoalus:
                Destination: sum[31]
                Requirement: 6.000ns
                Data Path Delay: 6.773ns (Levels of Logic = 18)
                Negative Clock Skew: 0.000ns
                Source Clock: clk_c rising at 0.000ns
                Destination Clock: clk_c rising at 6.000ns
                Timing Improvement Wizard
                Data Path: a[2] to sum[31]
                Location Delay type Delay(ns) Physical Resource
                Logical
                Resource(s)
                -------------------------------------------------
                -------------------
                SLICE_X11Y0.YQ Tcko 0.493 a[3]
                a[2]
                SLICE_X8Y1.F1 net (fanout=2) 0.749 a[2]
                SLICE_X8Y1.COUT Topcyf 0.668 sum_1_0[2]
                sum_1_0_axb_2
                sum_1_0_cry_2
                sum_1_0_cry_3
                SLICE_X8Y2.CIN net (fanout=1) 0.000 sum_1_0_cry_3/O
                SLICE_X8Y2.COUT Tbyp 0.092 sum_1_0[4]
                SLICE_X8Y3.CIN net (fanout=1) 0.000 sum_1_0_cry_5/O
                SLICE_X8Y3.COUT Tbyp 0.092 sum_1_0[6]
                SLICE_X8Y4.CIN net (fanout=1) 0.000 sum_1_0_cry_7/O
                SLICE_X8Y4.COUT Tbyp 0.092 sum_1_0[8]
                SLICE_X8Y5.CIN net (fanout=1) 0.000 sum_1_0_cry_9/O
                SLICE_X8Y5.COUT Tbyp 0.092 sum_1_0[10]
                SLICE_X8Y6.CIN net (fanout=1) 0.000 sum_1_0_cry_11/O
                SLICE_X8Y6.COUT Tbyp 0.092 sum_1_0[12]
                SLICE_X8Y7.CIN net (fanout=1) 0.000 sum_1_0_cry_13/O
                SLICE_X8Y7.COUT Tbyp 0.092 sum_1_0[14]
                SLICE_X8Y8.CIN net (fanout=1) 0.000 sum_1_0_cry_15/O
                SLICE_X8Y8.COUT Tbyp 0.092 sum_1_0[16]
                SLICE_X8Y9.CIN net (fanout=1) 0.000 sum_1_0_cry_17/O
                SLICE_X8Y9.COUT Tbyp 0.092 sum_1_0[18]
                SLICE_X8Y10.CIN net (fanout=1) 0.000 sum_1_0_cry_19/O
                SLICE_X8Y10.COUT Tbyp 0.092 sum_1_0[20]
                SLICE_X8Y11.CIN net (fanout=1) 0.000 sum_1_0_cry_21/O
                SLICE_X8Y11.COUT Tbyp 0.092 sum_1_0[22]
                SLICE_X8Y12.CIN net (fanout=1) 0.000 sum_1_0_cry_23/O
                SLICE_X8Y12.COUT Tbyp 0.092 sum_1_0[24]
                SLICE_X8Y13.CIN net (fanout=1) 0.000 sum_1_0_cry_25/O
                SLICE_X8Y13.COUT Tbyp 0.092 sum_1_0[26]
                SLICE_X8Y14.CIN net (fanout=1) 0.000 sum_1_0_cry_27/O
                SLICE_X8Y14.Y Tciny 1.257 sum_1_0[28]
                sum_1_0_cry_28
                sum_1_0_s_29
                SLICE_X9Y14.G2 net (fanout=1) 0.403 sum_1_0[29]
                SLICE_X9Y14.COUT Topcyg 0.519 sum[28]
                sum_1_1_axb_29
                sum_1_1_cry_29
                SLICE_X9Y15.CIN net (fanout=1) 0.000 sum_1_1_cry_29/O
                SLICE_X9Y15.Y Tciny 1.257 sum[30]
                sum_1_1_cry_30
                sum_1_1_s_31
                SLICE_X9Y15.DY net (fanout=1) 0.001 sum_1[31]
                SLICE_X9Y15.CLK Tdyck 0.322 sum[30]
                sum[31]
                -------------------------------------------------
                ---------------------------
                Total 6.773ns (5.620ns logic,
                1.153ns route)
                (83.0% logic,
                17.0% route)


                4.7 ns for one ALU in one pipeline stage
                6.8 ns for two ALUs in one pipeline stage

                Hmm, that's worse than I had expected. Looking above, the expensive bit
                is the extra Topcyg and especially Tciny of the second adder. All we
                are saving with the concurrency is the 15*Tbyp of about 1.4 ns.

                Jan Gray, Gray Research LLC
              • Campbell, John
                Interesting. But why do both onealu and twoalus have 18 levels of logic? ... I would have expected (probably naively) them to be 32 and 64, respectively. So
                Message 7 of 14 , Mar 12, 2002
                  Interesting.
                  But why do both onealu and twoalus have 18 levels of
                  logic?


                  > Data Path Delay: 4.737ns (Levels of Logic = 18)

                  > Data Path Delay: 6.773ns (Levels of Logic = 18)

                  I would have expected (probably naively) them to be 32
                  and 64, respectively. So what gives?
                  -jc
                • Jan Gray
                  ... Ah, that illustrates my point. First of all, they re not 32/64, because they re not strictly levels of logic but more precisely levels of clumps of
                  Message 8 of 14 , Mar 12, 2002
                    > Interesting.
                    > But why do both onealu and twoalus have 18 levels of
                    > logic?
                    >
                    >
                    > > Data Path Delay: 4.737ns (Levels of Logic = 18)
                    >
                    > > Data Path Delay: 6.773ns (Levels of Logic = 18)
                    >
                    > I would have expected (probably naively) them to be 32
                    > and 64, respectively. So what gives?
                    > -jc

                    Ah, that illustrates my point.

                    First of all, they're not 32/64, because they're not strictly "levels of
                    logic" but more precisely "levels of clumps of primitives": the Tbyp is
                    per a 2 bit clump, an adder slice, so your question might equivalently
                    be: why not 15 Tbyp's for one adder, and 30 Tbyp's for the 2 adder
                    version?

                    Second, contemplate there is no path from the MSB (a+b)[31] to the LSB
                    of ((a+b)+c)[0]. Comtemplate that ((a+b)+c)[15] depends upon (a+b)[15]
                    and c[15] and therefore upon a[15:0] and b[15:0], but not upon
                    (a+b)[31:16] nor upon a[31:16] nor b[31:16].

                    To put it prosaicly, carries ripple up the first adder, sums shoot over
                    to the second adder, and then ripple up there some more. But carries do
                    NOT ripple up the first adder, leave it in sum bits, and then somehow
                    enter the second adder at a lower bit index than the carry left the
                    first adder.

                    So if you consider the MSB of the second adder, ((a+b)+c)[31], it is
                    indeed dependent upon all of a[31:0], b[31:0], c[31:0] but the longest
                    path from any of those registers to ((a+b)+c)[31] is only 15 or so
                    Tbyp's.

                    (If all this doesn't help, draw a picture of two side by side adders and
                    observe how the carry and sum bits propagate only upwards and over, and
                    since the two adder structure is never taller than 16 slices tall, there
                    are at most 15 Tbyp delays even across the two adders.)

                    Third, note that TRCE output is sorted by delay and not by levels of
                    logic. If there was a circuit with 30+ "levels of logic", such as a
                    64-bit adder, that was faster, it would not be considered the critical
                    path and would not be reported by TRCE (if you perform the "report
                    timing violation paths" filter).

                    Jan Gray, Gray Research LLC
                  • Perez Ramas, Javier Basilio
                    Hi, ... Try this: http://www.optimagic.com Bye, Jaba
                    Message 9 of 14 , Mar 14, 2002
                      Hi,

                      > -----Mensaje original-----
                      > De: Alessandro Noriaki Ide [mailto:noriaki@...]
                      > Enviado el: jueves 14 de marzo de 2002 14:26
                      > Para: fpga-cpu@yahoogroups.com
                      > Asunto: [fpga-cpu] news
                      >
                      >
                      >
                      > Hi people,
                      >
                      >
                      > Does anybody know any site, continuously up to
                      > date,containing news about
                      > FPGAs and reconfigurable computing ?
                      >
                      >

                      Try this:

                      http://www.optimagic.com



                      Bye,
                      Jaba
                    • Alessandro Noriaki Ide
                      Hi people, Does anybody know any site, continuously up to date,containing news about FPGAs and reconfigurable computing ? thanks
                      Message 10 of 14 , Mar 14, 2002
                        Hi people,


                        Does anybody know any site, continuously up to date,containing news about
                        FPGAs and reconfigurable computing ?


                        thanks

                        =========================================

                        Alessandro Noriaki Ide
                        Mestrado em Ciencia da Computacao
                        Universidade Federal de Sao Carlos
                        http://www.dc.ufscar.br/~noriaki
                        e-mail: noriaki@...
                        uin: 46269985

                        ==========================================
                      • Reinoud
                        Jan, Excellent writeup; here s too late and too short a reply... ... ...and making compiler and hardware work together to keep hardware bottlenecks like bypass
                        Message 11 of 14 , Mar 17, 2002
                          Jan,

                          Excellent writeup; here's too late and too short a reply...

                          > The challenging part of a VLIW project is the compiler.

                          ...and making compiler and hardware work together to keep hardware
                          bottlenecks like bypass (i.e. forwarding) network and register
                          file port scaling in check (for >4 ILP).

                          > To summarize my opinions, is a 2-issue machine is straightforward and
                          > even worthwhile, whereas (depending upon computation kernel) a 4+ issue
                          > machine may well bog down in its multiplexers and therefore may not make
                          > the best use of FPGA resources.

                          There are several ways around this. One obvious approach is
                          clustering (a cluster is basically a small VLIW section with full
                          bypassing and register ports, that is only loosely coupled with
                          other sections). Another less obvious approach is to explicitly
                          program the bypass busses (which allows for a huge reduction in
                          bypass _and_ register file port cost!). Hint, see:

                          http://ce-serv.et.tudelft.nl/MOVE/

                          (Disclaimer: that's where I work, although on an interesting and
                          mostly unpublished new arch variety.)

                          > REGISTER FILE DESIGN
                          >
                          > ...
                          >
                          > One $64 design question, properly answerable only with a good compiler
                          > at your side, is how to then arrange the read ports. At one extreme, if
                          > each of the four banks is entirely isolated from the other, (like a
                          > left-brain/right-brain patient with their corpus collosum cut), then you
                          > only need 2 read ports on each bank.

                          This probably won't work out well. Such isolation _does_ work quite
                          well for clusters within a wide VLIW, but within the cluster you'll
                          need more communication (we have an upcoming paper about this;).

                          > Result forwarding (register file bypass): if an operation in slot i,
                          > cycle 1, is permitted with full generality, to consume the result of an
                          > operation from slot j, cycle 0, then you need n copies of an n:1 result
                          > forwarding mux. Again, this is very expensive, so you will be sorely
                          > tempted to reduce the generality of result forwarding, or eliminate it
                          > entirely. Again, this is a design tradeoff your compiler must help you
                          > select.

                          Generality makes scheduling easier, but not much, and is certainly
                          not needed. The effects of clustering are fairly well documented,
                          and we've had excellent results with scheduling for more general
                          limited interconnect VLIWs.

                          > Instead, you will want to use some form of predicated execution. Some
                          > computations will set predicate bits that represent the outcomes of
                          > conditional tests. Then other instructions' execution (result
                          > write-back and other side effects) will be predicated upon these
                          > specified predicate bits.

                          This won't work well with your proposed single-write-port register
                          files. When selecting between two potential producers of some value,
                          with a single write port, you'll have to serialize the predicated
                          register writes... Which tends to hurt on a VLIW. Predicates on a
                          VLIW make most sense with multiple write ports.

                          - Reinoud

                          PS. Did you ever consider writing a book on FPGA-CPU's?
                        • Jan Gray
                          ... Ah, move machines. I will take a look, thank you. ... I agree, I also doubt it will work well, but I was just trying to sketch the extremes of
                          Message 12 of 14 , Mar 18, 2002
                            > Hint, see:
                            > http://ce-serv.et.tudelft.nl/MOVE/

                            Ah, move machines. I will take a look, thank you.

                            > At one extreme,
                            > > if each of the four banks is entirely isolated from the
                            > other, (like a
                            > > left-brain/right-brain patient with their corpus collosum
                            > cut), then
                            > > you only need 2 read ports on each bank.
                            >
                            > This probably won't work out well. Such isolation _does_
                            > work quite well for clusters within a wide VLIW, but within
                            > the cluster you'll need more communication (we have an
                            > upcoming paper about this;).

                            I agree, I also doubt it will work well, but I was just trying to sketch
                            the extremes of possibility.

                            > > Instead, you will want to use some form of predicated
                            > execution. Some
                            > > computations will set predicate bits that represent the outcomes of
                            > > conditional tests. Then other instructions' execution (result
                            > > write-back and other side effects) will be predicated upon these
                            > > specified predicate bits.
                            >
                            > This won't work well with your proposed single-write-port
                            > register files. When selecting between two potential
                            > producers of some value, with a single write port, you'll
                            > have to serialize the predicated register writes... Which
                            > tends to hurt on a VLIW. Predicates on a VLIW make most
                            > sense with multiple write ports.

                            I advocate 1- or 2-write *data* register files for LUT RAM
                            implementations. This is not a constraint for predicate register files.
                            N-read m-write predicate register files can easily be made from
                            assemblies of flip-flops, logic to drive CEs, and logic for output
                            multiplexers.

                            By the way, on my trip down to SF for ESC, I took along "Readings in
                            Computer Architecture", ed. by Mark Hill et al. It's a wonderful book,
                            most highly recommended, a compilation of 5 dozen important modern
                            computer architecture papers, well organized and set in context.
                            Apropos to this discussion is the paper "A Comparison of Full and
                            Partial Predicated Execution Support for ILP Processors" by Mahlke et
                            al, which compares fully predication architectures to traditional
                            non-predicated superscalars retrofitted with CMOVE (if (pred) rdest =
                            rsrc) and SELECT (rdest = pred ? rsrc1 : rsrc2). This paper provides a
                            nice introduction to ILP compiler support (superblocks and if-conversion
                            etc.) and how to use that to take e.g. a 14 cycle grep inner loop down
                            to 6 cycles in a 4-issue 1-branch VLIW.

                            By the way^2, this is the kind of compilation IA64 must do well to
                            compete with out-of-order superscalars. I suspect one of IA64's
                            greatest challenges will be the social engineering required to make
                            software developers change from
                            compile -optimized *.c
                            to
                            compile -profiling *.c
                            profile app against representative workloads
                            compile -profile-feedback-optimizing *.c
                            The latter is more difficult to get right and to institutionalize in
                            makefiles or what-have-you.

                            > PS. Did you ever consider writing a book on FPGA-CPU's?

                            Thank you very much for the compliment. '96-'00 I wrote a few book
                            outlines, and discussed the project with a leading publisher, but never
                            pushed the button.

                            "Writing is an adventure. To begin with, it is a toy and an amusement.
                            Then it becomes a mistress, then it becomes a master, then it becomes a
                            tyrant. The last phase is that just as you are about to be reconciled to
                            your servitude, you kill the monster and fling him to the public." --
                            Churchill.

                            The Circuit Cellar articles were an attempt to get the book bug out of
                            my system. Didn't work.

                            Jan Gray, Gray Research LLC
                          • Reinoud
                            Another rather late reply (just back from vacation :-). ... But I _was_ talking about data register files, not predicate register files. Using predicates in
                            Message 13 of 14 , Apr 1, 2002
                              Another rather late reply (just back from vacation :-).

                              "Jan Gray" <jsgray@...> wrote:
                              > > > Instead, you will want to use some form of predicated execution. Some
                              > > > computations will set predicate bits that represent the outcomes of
                              > > > conditional tests. Then other instructions' execution (result
                              > > > write-back and other side effects) will be predicated upon these
                              > > > specified predicate bits.
                              > >
                              > > This won't work well with your proposed single-write-port
                              > > register files. When selecting between two potential
                              > > producers of some value, with a single write port, you'll
                              > > have to serialize the predicated register writes... Which
                              > > tends to hurt on a VLIW. Predicates on a VLIW make most
                              > > sense with multiple write ports.
                              >
                              > I advocate 1- or 2-write *data* register files for LUT RAM
                              > implementations. This is not a constraint for predicate register files.

                              But I _was_ talking about data register files, not predicate register
                              files. Using predicates in combination with data register files with a
                              single write port will often force serialization of execution. For
                              example, try if-converting x=(c)?a:b;. With a single write port you
                              are forced to place two assignments to x in separate instruction
                              words. Select operations may be more effective than predicates for
                              such an architecture.

                              > "Writing is an adventure. To begin with, it is a toy and an amusement.
                              > Then it becomes a mistress, then it becomes a master, then it becomes a
                              > tyrant. The last phase is that just as you are about to be reconciled to
                              > your servitude, you kill the monster and fling him to the public." --
                              > Churchill.

                              Yeah, that's why I asked if you would consider writing one. I sure
                              wouldn't :-).

                              - Reinoud
                            • Jan Gray
                              ... Ah, I see what you mean (I think): if (c) x = a; else x = b; if-converts to (predicated) x = a when c x = b when !c In a VLIW with a general register file,
                              Message 14 of 14 , Apr 1, 2002
                                > But I _was_ talking about data register files, not predicate
                                > register files. Using predicates in combination with data
                                > register files with a single write port will often force
                                > serialization of execution. For example, try if-converting
                                > x=(c)?a:b;. With a single write port you are forced to place
                                > two assignments to x in separate instruction words. Select
                                > operations may be more effective than predicates for such an
                                > architecture.

                                Ah, I see what you mean (I think):
                                if (c)
                                x = a;
                                else
                                x = b;
                                if-converts to (predicated)
                                x = a when c
                                x = b when !c

                                In a VLIW with a general register file, you can schedule these two issue
                                slots in the same wide instruction. In a VLIW with a partitioned
                                register file as we've been discussing, we would have to schedule it in
                                two successive instructions.

                                Now we could certainly extend the ALU to do SELECT (x = pred ? a : b)
                                and still get by with 2-read 1-write register file slices.

                                Why not both? All issue slots predicated + add SELECT to the ALU
                                operation repetoire...?

                                Jan Gray, Gray Research LLC
                              Your message has been successfully submitted and would be delivered to recipients shortly.