Loading ...
Sorry, an error occurred while loading the content.
 

Re: [fpga-cpu] Multiprocessors, Jan's Razor, resource sharing, and all that

Expand Messages
  • Ben Franchuk
    ... ... Remember too the use of RISC machines over CISC processors was do to better effective address calculation and simple integer operations over the
    Message 1 of 6 , Mar 5, 2002
      Jan Gray wrote:
      >
      <snip>
      > A more thoughtful approach, achieving fractional function units by
      > careful resource sharing, would appear to yield significant PE area
      > reductions, maximizing overall multiprocessor throughput, and providing
      > the system designer with a new dimension of design flexibility.

      Remember too the use of RISC machines over CISC processors was do to
      better
      effective address calculation and simple integer operations over the
      larger
      CISC set. A feature like 32 bit shifts was a advantage not because a
      person
      often shifts say 27 bits but rather 1,2,4 bits up or 1 bit down for
      simple
      shifts and efa calculations. Also as you get more processors on a chip
      memory
      bandwidth needs to increase, thus a super fast module may not be needed
      if it
      has to wait for memory.

      --
      Ben Franchuk - Dawn * 12/24 bit cpu *
      www.jetnet.ab.ca/users/bfranchuk/index.html
    • Reinoud
      Jan, This is a beautiful concept! I hope you don t mind a few arguments against it. :-) The limitations you claim for uniprocessor design exist only if you
      Message 2 of 6 , Mar 6, 2002
        Jan,

        This is a beautiful concept! I hope you don't mind a few arguments
        against it. :-)

        The limitations you claim for uniprocessor design exist only if you
        restrict yourself to scalar processors. For superscalar and VLIW
        designs, you have a similar freedom to ratio function unit types.
        For example, a typical superscalar design with three integer issue
        slots will support multiply on only one of those.

        Also, I have to doubt your claim of full utilization of shared
        function units (and cores) in a cluster, even if you provide them in
        a statistically perfect ratio for some application. One problem is
        that the optimal ratio varies over time. Another problem is that in
        order to use them fully, their use needs to be perfectly scheduled,
        in your case even over multiple independent instruction streams...
        That is a particularly hard problem, it's often impossible to get
        near full utilization even with just a single instruction stream.

        IMHO your argument gets a bit stressed when you suggest that things
        like cache and load-store resources might be shared. I think you'll
        find that sharing such resources among multiple instruction streams
        is both complex and costly, but that isn't even my point. The point
        is that there is a breathtakingly simple and obvious way to share all
        these resources, and keep the scheduling problem tractable: don't use
        multiple instruction streams! Instead, use a single instruction
        stream to directly control a nicely ratioed set of function units.
        This, of course, is known as VLIW...

        Sure, there are limits to useful scaling of VLIW cores, so if you
        want to scale further, it makes sense to go multiprocessor or SIMD.
        I'm not arguing against multiprocessors in general. But the function
        unit ratioing and resource sharing already exist at the uniprocessor
        level, where they can be used more efficiently.

        Finally, there may of course be applications where the relatively
        large amount of control provided by the many instruction streams in
        your approach (Sea of Cores - SoC?;) are an advantage. The challenge
        will be in finding those applications... Can you think of any?

        Regards,

        - Reinoud
      • Jan Gray
        Renoid, thank you for your superb comments. ... You re right, and I should have addressed multiple issue PEs -- but (I think) so am I. Even in a three issue
        Message 3 of 6 , Mar 6, 2002
          Renoid, thank you for your superb comments.

          > The limitations you claim for uniprocessor design exist only
          > if you restrict yourself to scalar processors. For
          > superscalar and VLIW designs, you have a similar freedom to
          > ratio function unit types.
          > For example, a typical superscalar design with three integer
          > issue slots will support multiply on only one of those.

          You're right, and I should have addressed multiple issue PEs -- but (I
          think) so am I. Even in a three issue VLIW, if you only need multiply
          one out of every ten instruction issue slots (3 or 4 cycles), maybe you
          should be sharing your multiply unit with other instances of your PE.

          > Also, I have to doubt your claim of full utilization of
          > shared function units (and cores) in a cluster, even if you
          > provide them in a statistically perfect ratio for some
          > application.

          Yes, I agree. My math was sloppy -- it was a concept I was trying to
          convey.

          > One problem is that the optimal ratio varies
          > over time. Another problem is that in order to use them
          > fully, their use needs to be perfectly scheduled, in your
          > case even over multiple independent instruction streams...
          > That is a particularly hard problem, it's often impossible to
          > get near full utilization even with just a single instruction stream.

          It is not so much the perfect scheduling or utilization I was after as
          the illusion that (most of the time) you have a dedicated unit even
          though you are sharing it with other PEs. I agree that my description
          over promises the concept. Put aside the scheduling notion, that's too
          hairy. Using statistics or queuing theory you can model how many shared
          resources belong with each cluster so as to bound the expected waiting
          time for a shared resource to a certain threshold.

          [By the way, I write "you can model" because *I can't* anymore -- that
          RAM went unrefreshed for too long. :-)]

          > IMHO your argument gets a bit stressed when you suggest that
          > things like cache and load-store resources might be shared.
          > I think you'll find that sharing such resources among
          > multiple instruction streams is both complex and costly, but
          > that isn't even my point.

          We'll see. :-) To some extent, it comes down to this. Some useful
          resource, done right, is too expensive to assign to each PE. Either we
          do a stripped down and limited subset of resource that is *just*
          affordable per PE, or we leave out the resource, and then *share* an
          instance, done right, amongst 3 or 5 or 10 PEs.

          I know from staring at it that a proper data cache, that also does byte
          and halfword alignment, and so forth, rivals the size of an austere
          processor data path, and only gets used every 2 or every 3 instructions.
          *** Particularly in an implementation fabric that is block RAM port
          constrained ***, it is very tempting to me to share one data cache
          between 2 or 3 PEs. The implementation cost of the sharing is some
          muxing and some arbitration logic, and perhaps some address-space-ID
          tags and tag checking. And as I note in the article, once you have paid
          for the muxing and the arbitration logic per PE, maybe you don't have to
          pay for it over and over again as you share additional resources.

          In any event, my goal was to convey the wide applicability of the
          concept, and how deeply it may lead you to rethink architecture in a
          multiprocessor -- not to specifically champion this or that resource as
          something that must be shared.

          I understand and appreciate your skepticism and you may well be right!

          > The point is that there is a
          > breathtakingly simple and obvious way to share all these
          > resources, and keep the scheduling problem tractable: don't
          > use multiple instruction streams! Instead, use a single
          > instruction stream to directly control a nicely ratioed set
          > of function units. This, of course, is known as VLIW...

          Good, good push back, thank you very much. I love LIWs (see last half
          of http://www.fpgacpu.org/usenet/homebrew.html), and I have a nice
          design for a three issue 3x21=64-bit instruction word LIW percolating
          right here (taps noggin). But I have two problems with them. 1) I
          don't think VLIWs are the best fit technology-mapping and area-wise for
          an FPGA implementation. That is, I believe, for example, that a three
          issue LIW will be less efficient (MIPS/LUT) than three instances of a
          single issue RISC, even though the latter instances incur three sets of
          instruction fetchers, PC incrementers, etc. I'll try to explain that
          more, and explore the LIW design space, in a subsequent write-up. 2)
          Past about 3-issue, it seems very hard for a compiler to keep all those
          issue slots busy with useful work, so they don't scale up enough, and
          then you are back to MPs. And wider issue VLIWs need a heroic compiler
          research program, which I don't have the resources to chase. (Since all
          I have is a hammer, everything looks like a nail to me.)

          I agree that an MP of 2- or 3-issue LIWs merits serious consideration
          compared to an MP of scalar RISCs.

          > I'm not arguing against multiprocessors in general. But the
          > function unit ratioing and resource sharing already exist at
          > the uniprocessor level, where they can be used more efficiently.

          Thank you again. In my enthusiasm for the concept, I did indeed
          overlook the points you make.

          > Finally, there may of course be applications where the
          > relatively large amount of control provided by the many
          > instruction streams in your approach (Sea of Cores - SoC?;)
          > are an advantage. The challenge will be in finding those
          > applications... Can you think of any?

          [First, let me note that most of the time, I too would prefer 1 1000
          MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors.
          That said ...]

          [Read along with me, to the sound of future patents flushing:]

          I confess, looking at the V600E 60-way MP I described recently, or its
          logical follow ons in V2 and so forth, I confess that these are paper
          tigers, with a lot of integer MIPS, in want of an application.
          Aggregate "D-MIPS" is not an application!

          I suppose my pet hand-wavy application for these concept chip-MPs is
          lexing and parsing XML and filtering that (and/or parse table
          construction for same) -- see http://www.fpgacpu.org/usenet/re.html).
          Let me set the stage for you.

          Imagine a future in which "web services" are ubiquitous -- the internet
          has evolved into a true distributed operating system, a cloud offering
          services to several billion connected devices. Imagine that the current
          leading transport candidate for internet RPC, namely SOAP -- (Simple
          Object Access Protocol, e.g. XML encoded RPC arguments and return
          values, on an HTTP transport, with interfaces described in WSDL (itself
          based upon XML Schema)) -- imagine SOAP indeed becomes the standard
          internet RPC. That's a *ton* of XML flying around. You will want your
          routers and firewalls, etc. of the future to filter, classify, route,
          etc. that XML at wire speed. That's a *ton* of ASCII lexing, parsing,
          and filtering. It's trivially parallelizable -- every second a thousand
          or a million separate HTTP sessions flash past your ports -- and
          therefore potentially a nice application for rack full of FPGAs, most
          FPGAs implementing a 100-way parsing and classification multiprocessor.

          Jan Gray
          Gray Research LLC
        • Reinoud
          Jan, I very much agree that efficient large parallel systems have to provide the right ratio of resources. I also think that control is just one of the
          Message 4 of 6 , Mar 6, 2002
            Jan,

            I very much agree that efficient large parallel systems have to
            provide the right ratio of resources. I also think that control is
            just one of the resources to share, and in many cases has to be
            shared among multiple function units for best results. From this
            point of view, using multiscalar PEs is just a natural extension of
            your proposal...

            > You're right, and I should have addressed multiple issue PEs -- but (I
            > think) so am I. Even in a three issue VLIW, if you only need multiply
            > one out of every ten instruction issue slots (3 or 4 cycles), maybe you
            > should be sharing your multiply unit with other instances of your PE.

            Yes, or maybe you should use VLIW PEs with an issue width of 10 in
            that case. For applications with a large amount of parallelism, PEs
            with an issue width of around 10 seems to be a sweet spot (beyond
            that width basic block sizes tend to become a problem and cycle times
            start to suffer). Also, at that width you can usually avoid the
            complexities of sharing resources between PEs (except memory of
            course, which presents a rather hairy problem all by itself).

            Note that there exist VLIW architectures that scale nicely to 10+
            issue widths, by using distributed register files and limited
            interconnect/bypassing. Also, note that if the amount of parallelism
            available is 'embarassing' enough to keep lots of single-issue PEs
            busy, this parallelism usually maps well to (significantly cheaper)
            VLIW or even SIMD architectures. For example, consider how many apps
            vectorize well.

            > We'll see. :-) To some extent, it comes down to this. Some useful
            > resource, done right, is too expensive to assign to each PE. Either we
            > do a stripped down and limited subset of resource that is *just*
            > affordable per PE, or we leave out the resource, and then *share* an
            > instance, done right, amongst 3 or 5 or 10 PEs.

            But it seems to me that the latter will nearly always be less
            efficient than sharing said resource with several others within one
            multi-issue PE. The single, centralized control will not only be
            simpler and cheaper, but also more efficient because of the
            scheduling opportunities.

            > I know from staring at it that a proper data cache, that also does byte
            > and halfword alignment, and so forth, rivals the size of an austere
            > processor data path, and only gets used every 2 or every 3 instructions.
            > *** Particularly in an implementation fabric that is block RAM port
            > constrained ***, it is very tempting to me to share one data cache
            > between 2 or 3 PEs.

            I fully agree; however, this works just as well for a VLIW. On a
            10-issue VLIW you may have just 2 slots for load-store (and none of
            the overhead for sharing).

            > The implementation cost of the sharing is some
            > muxing and some arbitration logic, and perhaps some address-space-ID
            > tags and tag checking. And as I note in the article, once you have paid
            > for the muxing and the arbitration logic per PE, maybe you don't have to
            > pay for it over and over again as you share additional resources.

            Wouldn't you like to be able to use multiple shared resources
            independently (say, one PE accesses the multiplier, another the
            barrel shifter, and a third shared memory, in the same cycle)?
            Otherwise you'll be severely limiting the utilization of these
            expensive resources! Implementing parallel access to these is much
            simpler (and cheaper) when there is a single control source...

            > In any event, my goal was to convey the wide applicability of the
            > concept, and how deeply it may lead you to rethink architecture in a
            > multiprocessor -- not to specifically champion this or that resource as
            > something that must be shared.

            And I'm trying to make you see control as yet another resource that
            can, and often should, be shared - to the point where it obviates the
            need for sharing most other resources.

            > 1) I don't think VLIWs are the best fit technology-mapping and
            > area-wise for an FPGA implementation. That is, I believe, for
            > example, that a three issue LIW will be less efficient (MIPS/LUT)
            > than three instances of a single issue RISC, even though the latter
            > instances incur three sets of instruction fetchers, PC incrementers,
            > etc. I'll try to explain that more, and explore the LIW design
            > space, in a subsequent write-up.

            I'm looking forward to rebutting that future write-up ;-). Note that
            the particular variety of VLIW that I think has the most potential
            uses explicitly programmed bypasses, which leads to a huge reduction
            in bypass (i.e. bus/selector) cost for wide configurations.

            > 2) Past about 3-issue, it seems very hard for a compiler to keep all
            > those issue slots busy with useful work, so they don't scale up
            > enough,

            This depends on how much of the parallelism can be extracted as
            instruction-level parallelism (ILP). Most applications with abundant
            parallelism map very well to 10+ issue widths (VLIW or SIMD/vector),
            at least if coded reasonably. The 3-issue ILP limit you mention is
            for essentially *sequential* code.

            For that matter, are you aware of any compiler that can keep more
            than 3 _PEs_ busy? (And if you feel it's okay to manually
            parallelize code for multiple single-issue PEs, then it also seems
            reasonable to manually schedule VLIWs...)

            > and then you are back to MPs.
            >
            > And wider issue VLIWs need a heroic compiler research program, which
            > I don't have the resources to chase. (Since all I have is a hammer,
            > everything looks like a nail to me.)

            Well... What if your PEs have to communicate/synchronize a lot?
            Many apps require that to be able to use the parallelism. Now,
            within a single instruction stream, synchronization is implicit and
            communication is easy and cheap. So with VLIW PEs, you can go a long
            way by partitioning for minimum communication, and scheduling the
            operations within a node. Yes, that's hard, but is it harder than
            targetting your massively parallel shared-memory MP? Do you have a
            compiler which can target one of your shared-memory clusters (from
            your regular sequential HLL source)?. Isn't this even _harder_ than
            scheduling for VLIW? Aren't you just accepting that you'll have to
            program your intra-cluster communication explicitly, and would it not
            be fair, then, to accept manual scheduling for a VLIW PE also?

            Your approach is beautiful and simple in concept. However, I'm not
            buying your arguments that it is particularly efficient or easy to
            use :-). That's true only for apps which map trivially onto that
            particular structure, but then the same could be said for VLIWs or
            any other programmable structure...

            On a final note, there are several VLIW compilers/schedulers more or
            less freely available, each with its particular limitations of course
            (I don't have up-to-date pointers, but can ask a colleague for them
            if you like).

            > I suppose my pet hand-wavy application for these concept chip-MPs is
            > lexing and parsing XML and filtering that (and/or parse table
            > construction for same) -- see http://www.fpgacpu.org/usenet/re.html).

            Very interesting indeed!

            Best regards,

            - Reinoud
          Your message has been successfully submitted and would be delivered to recipients shortly.