Loading ...
Sorry, an error occurred while loading the content.
 

Re: [fpga-cpu] Multiprocessors, Jan's Razor, resource sharing, and all that

Expand Messages
  • Ben Franchuk
    ... ... Remember too the use of RISC machines over CISC processors was do to better effective address calculation and simple integer operations over the
    Message 1 of 6 , Mar 5, 2002
      Jan Gray wrote:
      >
      <snip>
      > A more thoughtful approach, achieving fractional function units by
      > careful resource sharing, would appear to yield significant PE area
      > reductions, maximizing overall multiprocessor throughput, and providing
      > the system designer with a new dimension of design flexibility.

      Remember too the use of RISC machines over CISC processors was do to
      better
      effective address calculation and simple integer operations over the
      larger
      CISC set. A feature like 32 bit shifts was a advantage not because a
      person
      often shifts say 27 bits but rather 1,2,4 bits up or 1 bit down for
      simple
      shifts and efa calculations. Also as you get more processors on a chip
      memory
      bandwidth needs to increase, thus a super fast module may not be needed
      if it
      has to wait for memory.

      --
      Ben Franchuk - Dawn * 12/24 bit cpu *
      www.jetnet.ab.ca/users/bfranchuk/index.html
    • Jan Gray
      Renoid, thank you for your superb comments. ... You re right, and I should have addressed multiple issue PEs -- but (I think) so am I. Even in a three issue
      Message 2 of 6 , Mar 6, 2002
        Renoid, thank you for your superb comments.

        > The limitations you claim for uniprocessor design exist only
        > if you restrict yourself to scalar processors. For
        > superscalar and VLIW designs, you have a similar freedom to
        > ratio function unit types.
        > For example, a typical superscalar design with three integer
        > issue slots will support multiply on only one of those.

        You're right, and I should have addressed multiple issue PEs -- but (I
        think) so am I. Even in a three issue VLIW, if you only need multiply
        one out of every ten instruction issue slots (3 or 4 cycles), maybe you
        should be sharing your multiply unit with other instances of your PE.

        > Also, I have to doubt your claim of full utilization of
        > shared function units (and cores) in a cluster, even if you
        > provide them in a statistically perfect ratio for some
        > application.

        Yes, I agree. My math was sloppy -- it was a concept I was trying to
        convey.

        > One problem is that the optimal ratio varies
        > over time. Another problem is that in order to use them
        > fully, their use needs to be perfectly scheduled, in your
        > case even over multiple independent instruction streams...
        > That is a particularly hard problem, it's often impossible to
        > get near full utilization even with just a single instruction stream.

        It is not so much the perfect scheduling or utilization I was after as
        the illusion that (most of the time) you have a dedicated unit even
        though you are sharing it with other PEs. I agree that my description
        over promises the concept. Put aside the scheduling notion, that's too
        hairy. Using statistics or queuing theory you can model how many shared
        resources belong with each cluster so as to bound the expected waiting
        time for a shared resource to a certain threshold.

        [By the way, I write "you can model" because *I can't* anymore -- that
        RAM went unrefreshed for too long. :-)]

        > IMHO your argument gets a bit stressed when you suggest that
        > things like cache and load-store resources might be shared.
        > I think you'll find that sharing such resources among
        > multiple instruction streams is both complex and costly, but
        > that isn't even my point.

        We'll see. :-) To some extent, it comes down to this. Some useful
        resource, done right, is too expensive to assign to each PE. Either we
        do a stripped down and limited subset of resource that is *just*
        affordable per PE, or we leave out the resource, and then *share* an
        instance, done right, amongst 3 or 5 or 10 PEs.

        I know from staring at it that a proper data cache, that also does byte
        and halfword alignment, and so forth, rivals the size of an austere
        processor data path, and only gets used every 2 or every 3 instructions.
        *** Particularly in an implementation fabric that is block RAM port
        constrained ***, it is very tempting to me to share one data cache
        between 2 or 3 PEs. The implementation cost of the sharing is some
        muxing and some arbitration logic, and perhaps some address-space-ID
        tags and tag checking. And as I note in the article, once you have paid
        for the muxing and the arbitration logic per PE, maybe you don't have to
        pay for it over and over again as you share additional resources.

        In any event, my goal was to convey the wide applicability of the
        concept, and how deeply it may lead you to rethink architecture in a
        multiprocessor -- not to specifically champion this or that resource as
        something that must be shared.

        I understand and appreciate your skepticism and you may well be right!

        > The point is that there is a
        > breathtakingly simple and obvious way to share all these
        > resources, and keep the scheduling problem tractable: don't
        > use multiple instruction streams! Instead, use a single
        > instruction stream to directly control a nicely ratioed set
        > of function units. This, of course, is known as VLIW...

        Good, good push back, thank you very much. I love LIWs (see last half
        of http://www.fpgacpu.org/usenet/homebrew.html), and I have a nice
        design for a three issue 3x21=64-bit instruction word LIW percolating
        right here (taps noggin). But I have two problems with them. 1) I
        don't think VLIWs are the best fit technology-mapping and area-wise for
        an FPGA implementation. That is, I believe, for example, that a three
        issue LIW will be less efficient (MIPS/LUT) than three instances of a
        single issue RISC, even though the latter instances incur three sets of
        instruction fetchers, PC incrementers, etc. I'll try to explain that
        more, and explore the LIW design space, in a subsequent write-up. 2)
        Past about 3-issue, it seems very hard for a compiler to keep all those
        issue slots busy with useful work, so they don't scale up enough, and
        then you are back to MPs. And wider issue VLIWs need a heroic compiler
        research program, which I don't have the resources to chase. (Since all
        I have is a hammer, everything looks like a nail to me.)

        I agree that an MP of 2- or 3-issue LIWs merits serious consideration
        compared to an MP of scalar RISCs.

        > I'm not arguing against multiprocessors in general. But the
        > function unit ratioing and resource sharing already exist at
        > the uniprocessor level, where they can be used more efficiently.

        Thank you again. In my enthusiasm for the concept, I did indeed
        overlook the points you make.

        > Finally, there may of course be applications where the
        > relatively large amount of control provided by the many
        > instruction streams in your approach (Sea of Cores - SoC?;)
        > are an advantage. The challenge will be in finding those
        > applications... Can you think of any?

        [First, let me note that most of the time, I too would prefer 1 1000
        MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors.
        That said ...]

        [Read along with me, to the sound of future patents flushing:]

        I confess, looking at the V600E 60-way MP I described recently, or its
        logical follow ons in V2 and so forth, I confess that these are paper
        tigers, with a lot of integer MIPS, in want of an application.
        Aggregate "D-MIPS" is not an application!

        I suppose my pet hand-wavy application for these concept chip-MPs is
        lexing and parsing XML and filtering that (and/or parse table
        construction for same) -- see http://www.fpgacpu.org/usenet/re.html).
        Let me set the stage for you.

        Imagine a future in which "web services" are ubiquitous -- the internet
        has evolved into a true distributed operating system, a cloud offering
        services to several billion connected devices. Imagine that the current
        leading transport candidate for internet RPC, namely SOAP -- (Simple
        Object Access Protocol, e.g. XML encoded RPC arguments and return
        values, on an HTTP transport, with interfaces described in WSDL (itself
        based upon XML Schema)) -- imagine SOAP indeed becomes the standard
        internet RPC. That's a *ton* of XML flying around. You will want your
        routers and firewalls, etc. of the future to filter, classify, route,
        etc. that XML at wire speed. That's a *ton* of ASCII lexing, parsing,
        and filtering. It's trivially parallelizable -- every second a thousand
        or a million separate HTTP sessions flash past your ports -- and
        therefore potentially a nice application for rack full of FPGAs, most
        FPGAs implementing a 100-way parsing and classification multiprocessor.

        Jan Gray
        Gray Research LLC
      • Reinoud
        Jan, I very much agree that efficient large parallel systems have to provide the right ratio of resources. I also think that control is just one of the
        Message 3 of 6 , Mar 6, 2002
          Jan,

          I very much agree that efficient large parallel systems have to
          provide the right ratio of resources. I also think that control is
          just one of the resources to share, and in many cases has to be
          shared among multiple function units for best results. From this
          point of view, using multiscalar PEs is just a natural extension of
          your proposal...

          > You're right, and I should have addressed multiple issue PEs -- but (I
          > think) so am I. Even in a three issue VLIW, if you only need multiply
          > one out of every ten instruction issue slots (3 or 4 cycles), maybe you
          > should be sharing your multiply unit with other instances of your PE.

          Yes, or maybe you should use VLIW PEs with an issue width of 10 in
          that case. For applications with a large amount of parallelism, PEs
          with an issue width of around 10 seems to be a sweet spot (beyond
          that width basic block sizes tend to become a problem and cycle times
          start to suffer). Also, at that width you can usually avoid the
          complexities of sharing resources between PEs (except memory of
          course, which presents a rather hairy problem all by itself).

          Note that there exist VLIW architectures that scale nicely to 10+
          issue widths, by using distributed register files and limited
          interconnect/bypassing. Also, note that if the amount of parallelism
          available is 'embarassing' enough to keep lots of single-issue PEs
          busy, this parallelism usually maps well to (significantly cheaper)
          VLIW or even SIMD architectures. For example, consider how many apps
          vectorize well.

          > We'll see. :-) To some extent, it comes down to this. Some useful
          > resource, done right, is too expensive to assign to each PE. Either we
          > do a stripped down and limited subset of resource that is *just*
          > affordable per PE, or we leave out the resource, and then *share* an
          > instance, done right, amongst 3 or 5 or 10 PEs.

          But it seems to me that the latter will nearly always be less
          efficient than sharing said resource with several others within one
          multi-issue PE. The single, centralized control will not only be
          simpler and cheaper, but also more efficient because of the
          scheduling opportunities.

          > I know from staring at it that a proper data cache, that also does byte
          > and halfword alignment, and so forth, rivals the size of an austere
          > processor data path, and only gets used every 2 or every 3 instructions.
          > *** Particularly in an implementation fabric that is block RAM port
          > constrained ***, it is very tempting to me to share one data cache
          > between 2 or 3 PEs.

          I fully agree; however, this works just as well for a VLIW. On a
          10-issue VLIW you may have just 2 slots for load-store (and none of
          the overhead for sharing).

          > The implementation cost of the sharing is some
          > muxing and some arbitration logic, and perhaps some address-space-ID
          > tags and tag checking. And as I note in the article, once you have paid
          > for the muxing and the arbitration logic per PE, maybe you don't have to
          > pay for it over and over again as you share additional resources.

          Wouldn't you like to be able to use multiple shared resources
          independently (say, one PE accesses the multiplier, another the
          barrel shifter, and a third shared memory, in the same cycle)?
          Otherwise you'll be severely limiting the utilization of these
          expensive resources! Implementing parallel access to these is much
          simpler (and cheaper) when there is a single control source...

          > In any event, my goal was to convey the wide applicability of the
          > concept, and how deeply it may lead you to rethink architecture in a
          > multiprocessor -- not to specifically champion this or that resource as
          > something that must be shared.

          And I'm trying to make you see control as yet another resource that
          can, and often should, be shared - to the point where it obviates the
          need for sharing most other resources.

          > 1) I don't think VLIWs are the best fit technology-mapping and
          > area-wise for an FPGA implementation. That is, I believe, for
          > example, that a three issue LIW will be less efficient (MIPS/LUT)
          > than three instances of a single issue RISC, even though the latter
          > instances incur three sets of instruction fetchers, PC incrementers,
          > etc. I'll try to explain that more, and explore the LIW design
          > space, in a subsequent write-up.

          I'm looking forward to rebutting that future write-up ;-). Note that
          the particular variety of VLIW that I think has the most potential
          uses explicitly programmed bypasses, which leads to a huge reduction
          in bypass (i.e. bus/selector) cost for wide configurations.

          > 2) Past about 3-issue, it seems very hard for a compiler to keep all
          > those issue slots busy with useful work, so they don't scale up
          > enough,

          This depends on how much of the parallelism can be extracted as
          instruction-level parallelism (ILP). Most applications with abundant
          parallelism map very well to 10+ issue widths (VLIW or SIMD/vector),
          at least if coded reasonably. The 3-issue ILP limit you mention is
          for essentially *sequential* code.

          For that matter, are you aware of any compiler that can keep more
          than 3 _PEs_ busy? (And if you feel it's okay to manually
          parallelize code for multiple single-issue PEs, then it also seems
          reasonable to manually schedule VLIWs...)

          > and then you are back to MPs.
          >
          > And wider issue VLIWs need a heroic compiler research program, which
          > I don't have the resources to chase. (Since all I have is a hammer,
          > everything looks like a nail to me.)

          Well... What if your PEs have to communicate/synchronize a lot?
          Many apps require that to be able to use the parallelism. Now,
          within a single instruction stream, synchronization is implicit and
          communication is easy and cheap. So with VLIW PEs, you can go a long
          way by partitioning for minimum communication, and scheduling the
          operations within a node. Yes, that's hard, but is it harder than
          targetting your massively parallel shared-memory MP? Do you have a
          compiler which can target one of your shared-memory clusters (from
          your regular sequential HLL source)?. Isn't this even _harder_ than
          scheduling for VLIW? Aren't you just accepting that you'll have to
          program your intra-cluster communication explicitly, and would it not
          be fair, then, to accept manual scheduling for a VLIW PE also?

          Your approach is beautiful and simple in concept. However, I'm not
          buying your arguments that it is particularly efficient or easy to
          use :-). That's true only for apps which map trivially onto that
          particular structure, but then the same could be said for VLIWs or
          any other programmable structure...

          On a final note, there are several VLIW compilers/schedulers more or
          less freely available, each with its particular limitations of course
          (I don't have up-to-date pointers, but can ask a colleague for them
          if you like).

          > I suppose my pet hand-wavy application for these concept chip-MPs is
          > lexing and parsing XML and filtering that (and/or parse table
          > construction for same) -- see http://www.fpgacpu.org/usenet/re.html).

          Very interesting indeed!

          Best regards,

          - Reinoud
        Your message has been successfully submitted and would be delivered to recipients shortly.