Loading ...
Sorry, an error occurred while loading the content.
 

RE: [fpga-cpu] VLIW processors anyone ?

Expand Messages
  • Jan Gray
    ... INTRODUCTION TO VLIW VLIW ::= very long instruction word: a machine with one instruction pointer (how s that, Reinoud? :-) ) that selects a long
    Message 1 of 14 , Mar 11, 2002
      > I want to do a VLIW processor in an FPGA, but I'm not confident what
      > I'm doing. Does anyone have links to sample (educational) VLIW
      > processor designs ? I've searched the net but can't seem to find
      > anything.

      INTRODUCTION TO VLIW

      VLIW ::= very long instruction word: a machine with one instruction
      pointer (how's that, Reinoud? :-) ) that selects a long instruction word
      that issues a plurality of operations to a plurality of function units
      each cycle. In a VLIW system, the compiler schedules the multiple
      operations into instructions; in contrast, in a superscalar RISC, the
      instruction issue hardware schedules one or more scalar instructions to
      issue at the same time. You may also see the term LIW vs. VLIW. I
      think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine.
      In some VLIWs, to improve code density, a variable number of operations
      can be issued each cycle.

      There's a rich literature on VLIWs (that I don't purport to have read),
      but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW
      Architectures.

      The IA64 EPIC "explicitly parallel instruction computing" architecture
      is the most notable VLIW derivative. Several DSPs like the TI C6x are
      LIWs. A famous early VLIW was the Multiflow Trace family, a
      supercomputer startup that lived approximately 1984-1990.

      The challenging part of a VLIW project is the compiler. Unless you're a
      glutton for tricky assembly programming, or have a very small and
      specific problem, it's hardly worth designing the hardware if you don't
      have a way to compile code to use the parallelism presented by the
      hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow
      you to pour C code through a compiler and simulator and configure the
      number and kind of function units to best trade off performance and area
      against the specific computation kernel you care about.



      Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.

      To summarize my opinions, is a 2-issue machine is straightforward and
      even worthwhile, whereas (depending upon computation kernel) a 4+ issue
      machine may well bog down in its multiplexers and therefore may not make
      the best use of FPGA resources.

      INSTRUCTION FETCH AND ISSUE

      No sweat. Store instructions in a block RAM based instruction cache.
      Each BRAM in Virtex derived architectures can read 2x16=32 bits each
      cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4
      BRAMs and you can easily fetch 100-200 bits of instruction word each
      cycle.

      Keeping the I-cache fed is another matter, left as an exercise.

      REGISTER FILE DESIGN

      The key design problem is the register file organization. If you are
      going to issue n operations each cycle, you must fetch 1-2*n operands
      and retire up to n results each cycle. That implies a 1-2*n-read
      n-write per cycle register file.

      As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU
      operation requires about T ns, then you can read or write to a single
      port LUT RAM (or read AND write to a dual port LUT RAM) in about T/2 ns.
      (In Virtex-II, T is about 5 ns).

      Let's discuss an n=4 operation machine. It is a challenge to retire 4
      results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons
      to 2T, you can sequentially retire all four results in 4*T/2 ns, and
      read operands in parallel using dual port LUT RAM.

      Thus in a fully general organization, where any of the four results can
      retire to any four registers, "it can't be done" in T ns as defined
      above.

      Instead, consider a partitioned register file design. For instance,
      imagine a 64-register machine partitioned into 4 banks of 16 registers
      each. Each cycle, one result can be retired into each bank. That is
      easily targeted to a LUT RAM implementation. Indeed, you can fix the
      machine such that each issue slot is associated with one write port on
      the register file.

      Say bank m is assigned registers i, i div 4 == m, with 0<=i<64, 0<=m<4.
      We can easily issue four independent instructions such as
      r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50

      One $64 design question, properly answerable only with a good compiler
      at your side, is how to then arrange the read ports. At one extreme, if
      each of the four banks is entirely isolated from the other, (like a
      left-brain/right-brain patient with their corpus collosum cut), then you
      only need 2 read ports on each bank. In this organization, if you need
      a register or two from another bank, you would typically burn an issue
      slot in that sourcing bank to read the registers, and maybe burn another
      slot in the destination bank to save the registers. (Alternately in the
      destination bank you can directly mux it into the dependent operation
      register).

      At the other extreme, if any of the four banks can with full generality
      read operands from any of the other four banks, e.g.

      r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1
      r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24; // cycle 2
      r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40; // cycle 3
      r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4

      Then each of the four banks would need to provide 8 read ports, and each
      of the 8 operand multiplexers (in front of the four ALUs) would need at
      least a 4-1 mux.

      You can of course provide all these ports with replicating or
      multi-cycling the register file LUT RAMs, but it won't be pretty.

      In practice, I believe that the expected degree of cross-bank register
      reading is limited. Maybe you typically only need 3 read ports per
      bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other
      slots' accesses. But you have to have a scheduling compiler to help you
      make these tradeoffs.

      By the way, in my 1994 brain dump
      (http://fpgacpu.org/usenet/homebrew.html), with the sketch of the NVLIW1
      2-issue LIW, I used exactly this partitioned register file technique,
      combining two banks of 3r-1w register files. For each issue slot, one
      operand was required to be a register in that bank; but the other
      operand could be a register in either bank.

      Another option is to combine multibanked registers with some heavily
      multiported registers. For instance, assume each issue slot can read or
      write 16 bank registers and 16 shared registers, subject across issue
      slots to some maximum number of shared register file read and write port
      accesses each cycle. The designers of the Sun MAJC used a similar
      approach.

      EXECUTION UNIT

      Assume a design with 4 ALUs, one branch unit, one memory port.

      Loads/stores: Perhaps you limit generality and only allow loads/stores
      to issue from a given slot (and then shuttle those results to other slot
      register banks using the above multiported reg file description). The
      general alternative, allowing any slot to issue loads/stores, requires
      you to mux any ALU output to be the ea, and a per-bank mux to allow the
      MDR input to be the result.

      Result forwarding (register file bypass): if an operation in slot i,
      cycle 1, is permitted with full generality, to consume the result of an
      operation from slot j, cycle 0, then you need n copies of an n:1 result
      forwarding mux. Again, this is very expensive, so you will be sorely
      tempted to reduce the generality of result forwarding, or eliminate it
      entirely. Again, this is a design tradeoff your compiler must help you
      select.

      CONTROL FLOW

      You want to reduce the number of branches. In average C code you can
      expect a branch every 5 (rule of thumb) generic RISC instructions or so.
      In a 4 issue machine you will spend your life branching.

      Instead, you will want to use some form of predicated execution. Some
      computations will set predicate bits that represent the outcomes of
      conditional tests. Then other instructions' execution (result
      write-back and other side effects) will be predicated upon these
      specified predicate bits.

      In this organization, it seems reasonable to allow any issue slot to
      establish predicates that can be used by any other issue slot; but for
      simplicity you will only need and want one, or perhaps two, of the issue
      slots to be able to issue (predicated) branch and jump instructions.

      You will want a compiler to help you design the specific details of the
      predication system.

      IN SUMMARY

      The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW
      in an FPGA. That's just stamping out more instances of the i-cache +
      execution unit core slice. Whether the resulting machine runs much
      faster than a much simpler 2-issue LIW is critically dependent upon your
      compiler technology.


      ASIDE: WM

      As an aside, I'll just briefly mention Wulf's WM machine
      (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
      many interesting features, but of particular interest here, each execute
      instruction is of the form
      rd = (rs1 op rs2) op rs3 ,
      an explicitly parallel representation that affords some additional
      parallelism over the canonical scalar RISC, but which does not increase
      the number of write ports required on the register file.

      Jan Gray
      Gray Research LLC
    • Jan Gray
      ... I see this design reflects many of the issues in my last message. I should read first, write second! Jan Gray Gray Research LLC
      Message 2 of 14 , Mar 11, 2002
        > Current design: http://www.birdcomputer.ca/blackbird.htm

        I see this design reflects many of the issues in my last message. I
        should read first, write second!

        Jan Gray
        Gray Research LLC
      • rtfinch35
        ... you re a ... don t ... I m planning on using the Transmeta approach. I can write a simple virtual instruction set to real instruction set translater. At
        Message 3 of 14 , Mar 11, 2002
          >
          > The challenging part of a VLIW project is the compiler. Unless
          you're a
          > glutton for tricky assembly programming, or have a very small and
          > specific problem, it's hardly worth designing the hardware if you
          don't
          >

          I'm planning on using the Transmeta approach. I can write a simple
          virtual instruction set to real instruction set translater. At first
          I'm aiming for a direct 1:1 translation that won't provide any
          speedup (in fact it'll slow things down because of the wider
          instruction word). As the translater gets better, performance will
          improve. (Perhaps I could use the xr16 ISA as the virtual instruction
          set? :)

          >
          > Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.
          >
          > To summarize my opinions, is a 2-issue machine is straightforward
          and
          > even worthwhile, whereas (depending upon computation kernel) a 4+
          issue
          > machine may well bog down in its multiplexers and therefore may not
          make
          > the best use of FPGA resources.
          >

          I quickly put together a datapath and it looks like the datapath
          takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
          course this is without all the other extras needed to make it work,
          and assuming I got it right (a pretty big assumption). The thing that
          appears to slow it down is actually the barrel shifter which I may
          get rid of by limiting shifts to 0 to 3 bits at once. I think I can
          get a peak instruction completion rate of five instructions per cycle
          (I admit a little contrived) but it makes for great marketing. (eg.
          1) a compare (target = condition codes), 2) a store (no target reg),
          3) an effective address calc (target = special addr reg) 4) a branch,
          5) some alu operation 6) maybe ? Completing five instructions per
          cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
          can't I ?


          > INSTRUCTION FETCH AND ISSUE
          >
          > No sweat. Store instructions in a block RAM based instruction >

          Planning on using a 96 bit (or possibly 128 bit) wide i-cache, with a
          32 bit memory interface.


          > and retire up to n results each cycle. That implies a 1-2*n-read
          > n-write per cycle register file.

          Um, I think you need about half as many write ports because only
          about 50% of instructions actually update registers.


          > CONTROL FLOW
          >
          > Instead, you will want to use some form of predicated execution.

          I was thinking of using branch prediction.

          > ASIDE: WM
          >
          > As an aside, I'll just briefly mention Wulf's WM machine
          > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with

          I tried to access some of the docs but the links seem to be broken :(

          > instruction is of the form
          > rd = (rs1 op rs2) op rs3 ,

          If the operations are stacked like this, wouldn't it increase the
          cycle time of the processor ?

          Rob
        • Jan Gray
          ... Be our guest... ... What matters: * wall clock time * total energy consumed by the computation * FPGA area * dev tools ease of use. What doesn t matter: *
          Message 4 of 14 , Mar 12, 2002
            > I'm planning on using the Transmeta approach. I can write a simple
            > virtual instruction set to real instruction set translater. At first
            > I'm aiming for a direct 1:1 translation that won't provide any
            > speedup (in fact it'll slow things down because of the wider
            > instruction word). As the translater gets better, performance will
            > improve. (Perhaps I could use the xr16 ISA as the virtual instruction
            > set? :)

            Be our guest...

            > I quickly put together a datapath and it looks like the datapath
            > takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
            > course this is without all the other extras needed to make it work,
            > and assuming I got it right (a pretty big assumption). The thing that
            > appears to slow it down is actually the barrel shifter which I may
            > get rid of by limiting shifts to 0 to 3 bits at once. I think I can
            > get a peak instruction completion rate of five instructions per cycle
            > (I admit a little contrived) but it makes for great marketing. (eg.
            > 1) a compare (target = condition codes), 2) a store (no target reg),
            > 3) an effective address calc (target = special addr reg) 4) a branch,
            > 5) some alu operation 6) maybe ? Completing five instructions per
            > cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
            > can't I ?

            What matters:
            * wall clock time
            * total energy consumed by the computation
            * FPGA area
            * dev tools ease of use.

            What doesn't matter:
            * aggregate MIPS

            :-) See e.g. http://fpgacpu.org/usenet/array.html for a (useless) 10^13
            ops/s machine.

            > > and retire up to n results each cycle. That implies a 1-2*n-read
            > > n-write per cycle register file.

            > Um, I think you need about half as many write ports because only
            > about 50% of instructions actually update registers.

            Certainly but it depends upon whether and how you handle the worst case
            cases, whether you have multiplexers between ALUs and write port(s),
            etc.

            > > CONTROL FLOW
            > >
            > > Instead, you will want to use some form of predicated execution.
            >
            > I was thinking of using branch prediction.

            That will help. So you only fetch useless instructions on a mispredict?

            > > ASIDE: WM
            > >
            > > As an aside, I'll just briefly mention Wulf's WM machine
            > > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
            >
            > I tried to access some of the docs but the links seem to be broken :(

            Hmm. I have the ACM SIGARCH paper on it around here somewhere. The two
            big architectural ideas are the two ops per instruction and the implicit
            streaming memory ports exposed as special purpose registers.

            > > instruction is of the form
            > > rd = (rs1 op rs2) op rs3 ,
            >
            > If the operations are stacked like this, wouldn't it increase the
            > cycle time of the processor ?

            No and no.

            First, IIRC, WM does the second op in a second pipeline stage.

            Second, as I first learned during the Sun SuperSPARC presentation at Hot
            Chips III (1991) (I think it was), the delay through two ALUs (a+b)+c is
            not necessarily the same as two times the delay through one ALU (a+b),
            since there is concurrency -- lower bits of (a+b) can be added to lower
            bits of c even as upper bits of (a+b) are still ripple-carry-settling.

            <Pause while I go and run some tests.>

            OK, here's are two quick and dirty test circuits, sans floorplanning:

            module onealu(clk, i, o);
            input clk, i;
            output o;
            reg o;

            reg [31:0] a, b, sum;
            reg z;

            always @(posedge clk) begin
            {a,b} <= {a,b,i};
            sum <= a+b;
            z <= sum == 0;
            o <= z;
            end
            endmodule


            module twoalus(clk, i, o);
            input clk, i;
            output o;
            reg o;

            reg [31:0] a, b, c, sum;
            reg z;

            always @(posedge clk) begin
            {a,b,c} <= {a,b,c,i};
            sum <= (a+b)+c;
            z <= sum == 0;
            o <= z;
            end
            endmodule

            'onealu' approximates the delay from pipeline registers, through an
            adder, and back into pipeline registers.
            'twoalu' approximates the same thing, but through two adders.

            Here are the TRCE numbers for Virtex-II -5:

            onealu:
            Slack: -0.237ns (requirement - (data path - negative
            clock skew))
            Source: b[0]
            Destination: sum[31]
            Requirement: 4.500ns
            Data Path Delay: 4.737ns (Levels of Logic = 18)
            Negative Clock Skew: 0.000ns
            Source Clock: clk_c rising at 0.000ns
            Destination Clock: clk_c rising at 4.500ns
            Timing Improvement Wizard
            Data Path: b[0] to sum[31]
            Location Delay type Delay(ns) Physical Resource
            Logical
            Resource(s)
            -------------------------------------------------
            -------------------
            L3.IQ1 Tiockiq 0.598 i
            b[0]
            SLICE_X0Y0.F1 net (fanout=2) 0.516 b[0]
            SLICE_X0Y0.COUT Topcyf 0.755 sum[0]
            sum_1_axb_0
            sum_1_cry_0
            sum_1_cry_1
            SLICE_X0Y1.CIN net (fanout=1) 0.000 sum_1_cry_1/O
            SLICE_X0Y1.COUT Tbyp 0.092 sum[2]
            SLICE_X0Y2.CIN net (fanout=1) 0.000 sum_1_cry_3/O
            SLICE_X0Y2.COUT Tbyp 0.092 sum[4]
            SLICE_X0Y3.CIN net (fanout=1) 0.000 sum_1_cry_5/O
            SLICE_X0Y3.COUT Tbyp 0.092 sum[6]
            SLICE_X0Y4.CIN net (fanout=1) 0.000 sum_1_cry_7/O
            SLICE_X0Y4.COUT Tbyp 0.092 sum[8]
            SLICE_X0Y5.CIN net (fanout=1) 0.000 sum_1_cry_9/O
            SLICE_X0Y5.COUT Tbyp 0.092 sum[10]
            SLICE_X0Y6.CIN net (fanout=1) 0.000 sum_1_cry_11/O
            SLICE_X0Y6.COUT Tbyp 0.092 sum[12]
            SLICE_X0Y7.CIN net (fanout=1) 0.000 sum_1_cry_13/O
            SLICE_X0Y7.COUT Tbyp 0.092 sum[14]
            SLICE_X0Y8.CIN net (fanout=1) 0.000 sum_1_cry_15/O
            SLICE_X0Y8.COUT Tbyp 0.092 sum[16]
            SLICE_X0Y9.CIN net (fanout=1) 0.000 sum_1_cry_17/O
            SLICE_X0Y9.COUT Tbyp 0.092 sum[18]
            SLICE_X0Y10.CIN net (fanout=1) 0.000 sum_1_cry_19/O
            SLICE_X0Y10.COUT Tbyp 0.092 sum[20]
            SLICE_X0Y11.CIN net (fanout=1) 0.000 sum_1_cry_21/O
            SLICE_X0Y11.COUT Tbyp 0.092 sum[22]
            SLICE_X0Y12.CIN net (fanout=1) 0.000 sum_1_cry_23/O
            SLICE_X0Y12.COUT Tbyp 0.092 sum[24]
            SLICE_X0Y13.CIN net (fanout=1) 0.000 sum_1_cry_25/O
            SLICE_X0Y13.COUT Tbyp 0.092 sum[26]
            SLICE_X0Y14.CIN net (fanout=1) 0.000 sum_1_cry_27/O
            SLICE_X0Y14.COUT Tbyp 0.092 sum[28]
            SLICE_X0Y15.CIN net (fanout=1) 0.000 sum_1_cry_29/O
            SLICE_X0Y15.Y Tciny 1.257 sum[30]
            sum_1_cry_30
            sum_1_s_31
            SLICE_X0Y15.DY net (fanout=1) 0.001 sum_1[31]
            SLICE_X0Y15.CLK Tdyck 0.322 sum[30]
            sum[31]
            -------------------------------------------------
            ---------------------------
            Total 4.737ns (4.220ns logic,
            0.517ns route)
            (89.1% logic,
            10.9% route)

            twoalus:
            Destination: sum[31]
            Requirement: 6.000ns
            Data Path Delay: 6.773ns (Levels of Logic = 18)
            Negative Clock Skew: 0.000ns
            Source Clock: clk_c rising at 0.000ns
            Destination Clock: clk_c rising at 6.000ns
            Timing Improvement Wizard
            Data Path: a[2] to sum[31]
            Location Delay type Delay(ns) Physical Resource
            Logical
            Resource(s)
            -------------------------------------------------
            -------------------
            SLICE_X11Y0.YQ Tcko 0.493 a[3]
            a[2]
            SLICE_X8Y1.F1 net (fanout=2) 0.749 a[2]
            SLICE_X8Y1.COUT Topcyf 0.668 sum_1_0[2]
            sum_1_0_axb_2
            sum_1_0_cry_2
            sum_1_0_cry_3
            SLICE_X8Y2.CIN net (fanout=1) 0.000 sum_1_0_cry_3/O
            SLICE_X8Y2.COUT Tbyp 0.092 sum_1_0[4]
            SLICE_X8Y3.CIN net (fanout=1) 0.000 sum_1_0_cry_5/O
            SLICE_X8Y3.COUT Tbyp 0.092 sum_1_0[6]
            SLICE_X8Y4.CIN net (fanout=1) 0.000 sum_1_0_cry_7/O
            SLICE_X8Y4.COUT Tbyp 0.092 sum_1_0[8]
            SLICE_X8Y5.CIN net (fanout=1) 0.000 sum_1_0_cry_9/O
            SLICE_X8Y5.COUT Tbyp 0.092 sum_1_0[10]
            SLICE_X8Y6.CIN net (fanout=1) 0.000 sum_1_0_cry_11/O
            SLICE_X8Y6.COUT Tbyp 0.092 sum_1_0[12]
            SLICE_X8Y7.CIN net (fanout=1) 0.000 sum_1_0_cry_13/O
            SLICE_X8Y7.COUT Tbyp 0.092 sum_1_0[14]
            SLICE_X8Y8.CIN net (fanout=1) 0.000 sum_1_0_cry_15/O
            SLICE_X8Y8.COUT Tbyp 0.092 sum_1_0[16]
            SLICE_X8Y9.CIN net (fanout=1) 0.000 sum_1_0_cry_17/O
            SLICE_X8Y9.COUT Tbyp 0.092 sum_1_0[18]
            SLICE_X8Y10.CIN net (fanout=1) 0.000 sum_1_0_cry_19/O
            SLICE_X8Y10.COUT Tbyp 0.092 sum_1_0[20]
            SLICE_X8Y11.CIN net (fanout=1) 0.000 sum_1_0_cry_21/O
            SLICE_X8Y11.COUT Tbyp 0.092 sum_1_0[22]
            SLICE_X8Y12.CIN net (fanout=1) 0.000 sum_1_0_cry_23/O
            SLICE_X8Y12.COUT Tbyp 0.092 sum_1_0[24]
            SLICE_X8Y13.CIN net (fanout=1) 0.000 sum_1_0_cry_25/O
            SLICE_X8Y13.COUT Tbyp 0.092 sum_1_0[26]
            SLICE_X8Y14.CIN net (fanout=1) 0.000 sum_1_0_cry_27/O
            SLICE_X8Y14.Y Tciny 1.257 sum_1_0[28]
            sum_1_0_cry_28
            sum_1_0_s_29
            SLICE_X9Y14.G2 net (fanout=1) 0.403 sum_1_0[29]
            SLICE_X9Y14.COUT Topcyg 0.519 sum[28]
            sum_1_1_axb_29
            sum_1_1_cry_29
            SLICE_X9Y15.CIN net (fanout=1) 0.000 sum_1_1_cry_29/O
            SLICE_X9Y15.Y Tciny 1.257 sum[30]
            sum_1_1_cry_30
            sum_1_1_s_31
            SLICE_X9Y15.DY net (fanout=1) 0.001 sum_1[31]
            SLICE_X9Y15.CLK Tdyck 0.322 sum[30]
            sum[31]
            -------------------------------------------------
            ---------------------------
            Total 6.773ns (5.620ns logic,
            1.153ns route)
            (83.0% logic,
            17.0% route)


            4.7 ns for one ALU in one pipeline stage
            6.8 ns for two ALUs in one pipeline stage

            Hmm, that's worse than I had expected. Looking above, the expensive bit
            is the extra Topcyg and especially Tciny of the second adder. All we
            are saving with the concurrency is the 15*Tbyp of about 1.4 ns.

            Jan Gray, Gray Research LLC
          • Campbell, John
            Interesting. But why do both onealu and twoalus have 18 levels of logic? ... I would have expected (probably naively) them to be 32 and 64, respectively. So
            Message 5 of 14 , Mar 12, 2002
              Interesting.
              But why do both onealu and twoalus have 18 levels of
              logic?


              > Data Path Delay: 4.737ns (Levels of Logic = 18)

              > Data Path Delay: 6.773ns (Levels of Logic = 18)

              I would have expected (probably naively) them to be 32
              and 64, respectively. So what gives?
              -jc
            • Jan Gray
              ... Ah, that illustrates my point. First of all, they re not 32/64, because they re not strictly levels of logic but more precisely levels of clumps of
              Message 6 of 14 , Mar 12, 2002
                > Interesting.
                > But why do both onealu and twoalus have 18 levels of
                > logic?
                >
                >
                > > Data Path Delay: 4.737ns (Levels of Logic = 18)
                >
                > > Data Path Delay: 6.773ns (Levels of Logic = 18)
                >
                > I would have expected (probably naively) them to be 32
                > and 64, respectively. So what gives?
                > -jc

                Ah, that illustrates my point.

                First of all, they're not 32/64, because they're not strictly "levels of
                logic" but more precisely "levels of clumps of primitives": the Tbyp is
                per a 2 bit clump, an adder slice, so your question might equivalently
                be: why not 15 Tbyp's for one adder, and 30 Tbyp's for the 2 adder
                version?

                Second, contemplate there is no path from the MSB (a+b)[31] to the LSB
                of ((a+b)+c)[0]. Comtemplate that ((a+b)+c)[15] depends upon (a+b)[15]
                and c[15] and therefore upon a[15:0] and b[15:0], but not upon
                (a+b)[31:16] nor upon a[31:16] nor b[31:16].

                To put it prosaicly, carries ripple up the first adder, sums shoot over
                to the second adder, and then ripple up there some more. But carries do
                NOT ripple up the first adder, leave it in sum bits, and then somehow
                enter the second adder at a lower bit index than the carry left the
                first adder.

                So if you consider the MSB of the second adder, ((a+b)+c)[31], it is
                indeed dependent upon all of a[31:0], b[31:0], c[31:0] but the longest
                path from any of those registers to ((a+b)+c)[31] is only 15 or so
                Tbyp's.

                (If all this doesn't help, draw a picture of two side by side adders and
                observe how the carry and sum bits propagate only upwards and over, and
                since the two adder structure is never taller than 16 slices tall, there
                are at most 15 Tbyp delays even across the two adders.)

                Third, note that TRCE output is sorted by delay and not by levels of
                logic. If there was a circuit with 30+ "levels of logic", such as a
                64-bit adder, that was faster, it would not be considered the critical
                path and would not be reported by TRCE (if you perform the "report
                timing violation paths" filter).

                Jan Gray, Gray Research LLC
              • Perez Ramas, Javier Basilio
                Hi, ... Try this: http://www.optimagic.com Bye, Jaba
                Message 7 of 14 , Mar 14, 2002
                  Hi,

                  > -----Mensaje original-----
                  > De: Alessandro Noriaki Ide [mailto:noriaki@...]
                  > Enviado el: jueves 14 de marzo de 2002 14:26
                  > Para: fpga-cpu@yahoogroups.com
                  > Asunto: [fpga-cpu] news
                  >
                  >
                  >
                  > Hi people,
                  >
                  >
                  > Does anybody know any site, continuously up to
                  > date,containing news about
                  > FPGAs and reconfigurable computing ?
                  >
                  >

                  Try this:

                  http://www.optimagic.com



                  Bye,
                  Jaba
                • Alessandro Noriaki Ide
                  Hi people, Does anybody know any site, continuously up to date,containing news about FPGAs and reconfigurable computing ? thanks
                  Message 8 of 14 , Mar 14, 2002
                    Hi people,


                    Does anybody know any site, continuously up to date,containing news about
                    FPGAs and reconfigurable computing ?


                    thanks

                    =========================================

                    Alessandro Noriaki Ide
                    Mestrado em Ciencia da Computacao
                    Universidade Federal de Sao Carlos
                    http://www.dc.ufscar.br/~noriaki
                    e-mail: noriaki@...
                    uin: 46269985

                    ==========================================
                  • Reinoud
                    Jan, Excellent writeup; here s too late and too short a reply... ... ...and making compiler and hardware work together to keep hardware bottlenecks like bypass
                    Message 9 of 14 , Mar 17, 2002
                      Jan,

                      Excellent writeup; here's too late and too short a reply...

                      > The challenging part of a VLIW project is the compiler.

                      ...and making compiler and hardware work together to keep hardware
                      bottlenecks like bypass (i.e. forwarding) network and register
                      file port scaling in check (for >4 ILP).

                      > To summarize my opinions, is a 2-issue machine is straightforward and
                      > even worthwhile, whereas (depending upon computation kernel) a 4+ issue
                      > machine may well bog down in its multiplexers and therefore may not make
                      > the best use of FPGA resources.

                      There are several ways around this. One obvious approach is
                      clustering (a cluster is basically a small VLIW section with full
                      bypassing and register ports, that is only loosely coupled with
                      other sections). Another less obvious approach is to explicitly
                      program the bypass busses (which allows for a huge reduction in
                      bypass _and_ register file port cost!). Hint, see:

                      http://ce-serv.et.tudelft.nl/MOVE/

                      (Disclaimer: that's where I work, although on an interesting and
                      mostly unpublished new arch variety.)

                      > REGISTER FILE DESIGN
                      >
                      > ...
                      >
                      > One $64 design question, properly answerable only with a good compiler
                      > at your side, is how to then arrange the read ports. At one extreme, if
                      > each of the four banks is entirely isolated from the other, (like a
                      > left-brain/right-brain patient with their corpus collosum cut), then you
                      > only need 2 read ports on each bank.

                      This probably won't work out well. Such isolation _does_ work quite
                      well for clusters within a wide VLIW, but within the cluster you'll
                      need more communication (we have an upcoming paper about this;).

                      > Result forwarding (register file bypass): if an operation in slot i,
                      > cycle 1, is permitted with full generality, to consume the result of an
                      > operation from slot j, cycle 0, then you need n copies of an n:1 result
                      > forwarding mux. Again, this is very expensive, so you will be sorely
                      > tempted to reduce the generality of result forwarding, or eliminate it
                      > entirely. Again, this is a design tradeoff your compiler must help you
                      > select.

                      Generality makes scheduling easier, but not much, and is certainly
                      not needed. The effects of clustering are fairly well documented,
                      and we've had excellent results with scheduling for more general
                      limited interconnect VLIWs.

                      > Instead, you will want to use some form of predicated execution. Some
                      > computations will set predicate bits that represent the outcomes of
                      > conditional tests. Then other instructions' execution (result
                      > write-back and other side effects) will be predicated upon these
                      > specified predicate bits.

                      This won't work well with your proposed single-write-port register
                      files. When selecting between two potential producers of some value,
                      with a single write port, you'll have to serialize the predicated
                      register writes... Which tends to hurt on a VLIW. Predicates on a
                      VLIW make most sense with multiple write ports.

                      - Reinoud

                      PS. Did you ever consider writing a book on FPGA-CPU's?
                    • Jan Gray
                      ... Ah, move machines. I will take a look, thank you. ... I agree, I also doubt it will work well, but I was just trying to sketch the extremes of
                      Message 10 of 14 , Mar 18, 2002
                        > Hint, see:
                        > http://ce-serv.et.tudelft.nl/MOVE/

                        Ah, move machines. I will take a look, thank you.

                        > At one extreme,
                        > > if each of the four banks is entirely isolated from the
                        > other, (like a
                        > > left-brain/right-brain patient with their corpus collosum
                        > cut), then
                        > > you only need 2 read ports on each bank.
                        >
                        > This probably won't work out well. Such isolation _does_
                        > work quite well for clusters within a wide VLIW, but within
                        > the cluster you'll need more communication (we have an
                        > upcoming paper about this;).

                        I agree, I also doubt it will work well, but I was just trying to sketch
                        the extremes of possibility.

                        > > Instead, you will want to use some form of predicated
                        > execution. Some
                        > > computations will set predicate bits that represent the outcomes of
                        > > conditional tests. Then other instructions' execution (result
                        > > write-back and other side effects) will be predicated upon these
                        > > specified predicate bits.
                        >
                        > This won't work well with your proposed single-write-port
                        > register files. When selecting between two potential
                        > producers of some value, with a single write port, you'll
                        > have to serialize the predicated register writes... Which
                        > tends to hurt on a VLIW. Predicates on a VLIW make most
                        > sense with multiple write ports.

                        I advocate 1- or 2-write *data* register files for LUT RAM
                        implementations. This is not a constraint for predicate register files.
                        N-read m-write predicate register files can easily be made from
                        assemblies of flip-flops, logic to drive CEs, and logic for output
                        multiplexers.

                        By the way, on my trip down to SF for ESC, I took along "Readings in
                        Computer Architecture", ed. by Mark Hill et al. It's a wonderful book,
                        most highly recommended, a compilation of 5 dozen important modern
                        computer architecture papers, well organized and set in context.
                        Apropos to this discussion is the paper "A Comparison of Full and
                        Partial Predicated Execution Support for ILP Processors" by Mahlke et
                        al, which compares fully predication architectures to traditional
                        non-predicated superscalars retrofitted with CMOVE (if (pred) rdest =
                        rsrc) and SELECT (rdest = pred ? rsrc1 : rsrc2). This paper provides a
                        nice introduction to ILP compiler support (superblocks and if-conversion
                        etc.) and how to use that to take e.g. a 14 cycle grep inner loop down
                        to 6 cycles in a 4-issue 1-branch VLIW.

                        By the way^2, this is the kind of compilation IA64 must do well to
                        compete with out-of-order superscalars. I suspect one of IA64's
                        greatest challenges will be the social engineering required to make
                        software developers change from
                        compile -optimized *.c
                        to
                        compile -profiling *.c
                        profile app against representative workloads
                        compile -profile-feedback-optimizing *.c
                        The latter is more difficult to get right and to institutionalize in
                        makefiles or what-have-you.

                        > PS. Did you ever consider writing a book on FPGA-CPU's?

                        Thank you very much for the compliment. '96-'00 I wrote a few book
                        outlines, and discussed the project with a leading publisher, but never
                        pushed the button.

                        "Writing is an adventure. To begin with, it is a toy and an amusement.
                        Then it becomes a mistress, then it becomes a master, then it becomes a
                        tyrant. The last phase is that just as you are about to be reconciled to
                        your servitude, you kill the monster and fling him to the public." --
                        Churchill.

                        The Circuit Cellar articles were an attempt to get the book bug out of
                        my system. Didn't work.

                        Jan Gray, Gray Research LLC
                      • Reinoud
                        Another rather late reply (just back from vacation :-). ... But I _was_ talking about data register files, not predicate register files. Using predicates in
                        Message 11 of 14 , Apr 1, 2002
                          Another rather late reply (just back from vacation :-).

                          "Jan Gray" <jsgray@...> wrote:
                          > > > Instead, you will want to use some form of predicated execution. Some
                          > > > computations will set predicate bits that represent the outcomes of
                          > > > conditional tests. Then other instructions' execution (result
                          > > > write-back and other side effects) will be predicated upon these
                          > > > specified predicate bits.
                          > >
                          > > This won't work well with your proposed single-write-port
                          > > register files. When selecting between two potential
                          > > producers of some value, with a single write port, you'll
                          > > have to serialize the predicated register writes... Which
                          > > tends to hurt on a VLIW. Predicates on a VLIW make most
                          > > sense with multiple write ports.
                          >
                          > I advocate 1- or 2-write *data* register files for LUT RAM
                          > implementations. This is not a constraint for predicate register files.

                          But I _was_ talking about data register files, not predicate register
                          files. Using predicates in combination with data register files with a
                          single write port will often force serialization of execution. For
                          example, try if-converting x=(c)?a:b;. With a single write port you
                          are forced to place two assignments to x in separate instruction
                          words. Select operations may be more effective than predicates for
                          such an architecture.

                          > "Writing is an adventure. To begin with, it is a toy and an amusement.
                          > Then it becomes a mistress, then it becomes a master, then it becomes a
                          > tyrant. The last phase is that just as you are about to be reconciled to
                          > your servitude, you kill the monster and fling him to the public." --
                          > Churchill.

                          Yeah, that's why I asked if you would consider writing one. I sure
                          wouldn't :-).

                          - Reinoud
                        • Jan Gray
                          ... Ah, I see what you mean (I think): if (c) x = a; else x = b; if-converts to (predicated) x = a when c x = b when !c In a VLIW with a general register file,
                          Message 12 of 14 , Apr 1, 2002
                            > But I _was_ talking about data register files, not predicate
                            > register files. Using predicates in combination with data
                            > register files with a single write port will often force
                            > serialization of execution. For example, try if-converting
                            > x=(c)?a:b;. With a single write port you are forced to place
                            > two assignments to x in separate instruction words. Select
                            > operations may be more effective than predicates for such an
                            > architecture.

                            Ah, I see what you mean (I think):
                            if (c)
                            x = a;
                            else
                            x = b;
                            if-converts to (predicated)
                            x = a when c
                            x = b when !c

                            In a VLIW with a general register file, you can schedule these two issue
                            slots in the same wide instruction. In a VLIW with a partitioned
                            register file as we've been discussing, we would have to schedule it in
                            two successive instructions.

                            Now we could certainly extend the ALU to do SELECT (x = pred ? a : b)
                            and still get by with 2-read 1-write register file slices.

                            Why not both? All issue slots predicated + add SELECT to the ALU
                            operation repetoire...?

                            Jan Gray, Gray Research LLC
                          Your message has been successfully submitted and would be delivered to recipients shortly.