Re: [fpga-cpu] Multiprocessors, Jan's Razor, resource sharing, and all that
- Jan Gray wrote:
> A more thoughtful approach, achieving fractional function units byRemember too the use of RISC machines over CISC processors was do to
> careful resource sharing, would appear to yield significant PE area
> reductions, maximizing overall multiprocessor throughput, and providing
> the system designer with a new dimension of design flexibility.
effective address calculation and simple integer operations over the
CISC set. A feature like 32 bit shifts was a advantage not because a
often shifts say 27 bits but rather 1,2,4 bits up or 1 bit down for
shifts and efa calculations. Also as you get more processors on a chip
bandwidth needs to increase, thus a super fast module may not be needed
has to wait for memory.
Ben Franchuk - Dawn * 12/24 bit cpu *
- Renoid, thank you for your superb comments.
> The limitations you claim for uniprocessor design exist onlyYou're right, and I should have addressed multiple issue PEs -- but (I
> if you restrict yourself to scalar processors. For
> superscalar and VLIW designs, you have a similar freedom to
> ratio function unit types.
> For example, a typical superscalar design with three integer
> issue slots will support multiply on only one of those.
think) so am I. Even in a three issue VLIW, if you only need multiply
one out of every ten instruction issue slots (3 or 4 cycles), maybe you
should be sharing your multiply unit with other instances of your PE.
> Also, I have to doubt your claim of full utilization ofYes, I agree. My math was sloppy -- it was a concept I was trying to
> shared function units (and cores) in a cluster, even if you
> provide them in a statistically perfect ratio for some
> One problem is that the optimal ratio variesIt is not so much the perfect scheduling or utilization I was after as
> over time. Another problem is that in order to use them
> fully, their use needs to be perfectly scheduled, in your
> case even over multiple independent instruction streams...
> That is a particularly hard problem, it's often impossible to
> get near full utilization even with just a single instruction stream.
the illusion that (most of the time) you have a dedicated unit even
though you are sharing it with other PEs. I agree that my description
over promises the concept. Put aside the scheduling notion, that's too
hairy. Using statistics or queuing theory you can model how many shared
resources belong with each cluster so as to bound the expected waiting
time for a shared resource to a certain threshold.
[By the way, I write "you can model" because *I can't* anymore -- that
RAM went unrefreshed for too long. :-)]
> IMHO your argument gets a bit stressed when you suggest thatWe'll see. :-) To some extent, it comes down to this. Some useful
> things like cache and load-store resources might be shared.
> I think you'll find that sharing such resources among
> multiple instruction streams is both complex and costly, but
> that isn't even my point.
resource, done right, is too expensive to assign to each PE. Either we
do a stripped down and limited subset of resource that is *just*
affordable per PE, or we leave out the resource, and then *share* an
instance, done right, amongst 3 or 5 or 10 PEs.
I know from staring at it that a proper data cache, that also does byte
and halfword alignment, and so forth, rivals the size of an austere
processor data path, and only gets used every 2 or every 3 instructions.
*** Particularly in an implementation fabric that is block RAM port
constrained ***, it is very tempting to me to share one data cache
between 2 or 3 PEs. The implementation cost of the sharing is some
muxing and some arbitration logic, and perhaps some address-space-ID
tags and tag checking. And as I note in the article, once you have paid
for the muxing and the arbitration logic per PE, maybe you don't have to
pay for it over and over again as you share additional resources.
In any event, my goal was to convey the wide applicability of the
concept, and how deeply it may lead you to rethink architecture in a
multiprocessor -- not to specifically champion this or that resource as
something that must be shared.
I understand and appreciate your skepticism and you may well be right!
> The point is that there is aGood, good push back, thank you very much. I love LIWs (see last half
> breathtakingly simple and obvious way to share all these
> resources, and keep the scheduling problem tractable: don't
> use multiple instruction streams! Instead, use a single
> instruction stream to directly control a nicely ratioed set
> of function units. This, of course, is known as VLIW...
of http://www.fpgacpu.org/usenet/homebrew.html), and I have a nice
design for a three issue 3x21=64-bit instruction word LIW percolating
right here (taps noggin). But I have two problems with them. 1) I
don't think VLIWs are the best fit technology-mapping and area-wise for
an FPGA implementation. That is, I believe, for example, that a three
issue LIW will be less efficient (MIPS/LUT) than three instances of a
single issue RISC, even though the latter instances incur three sets of
instruction fetchers, PC incrementers, etc. I'll try to explain that
more, and explore the LIW design space, in a subsequent write-up. 2)
Past about 3-issue, it seems very hard for a compiler to keep all those
issue slots busy with useful work, so they don't scale up enough, and
then you are back to MPs. And wider issue VLIWs need a heroic compiler
research program, which I don't have the resources to chase. (Since all
I have is a hammer, everything looks like a nail to me.)
I agree that an MP of 2- or 3-issue LIWs merits serious consideration
compared to an MP of scalar RISCs.
> I'm not arguing against multiprocessors in general. But theThank you again. In my enthusiasm for the concept, I did indeed
> function unit ratioing and resource sharing already exist at
> the uniprocessor level, where they can be used more efficiently.
overlook the points you make.
> Finally, there may of course be applications where the[First, let me note that most of the time, I too would prefer 1 1000
> relatively large amount of control provided by the many
> instruction streams in your approach (Sea of Cores - SoC?;)
> are an advantage. The challenge will be in finding those
> applications... Can you think of any?
MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors.
That said ...]
[Read along with me, to the sound of future patents flushing:]
I confess, looking at the V600E 60-way MP I described recently, or its
logical follow ons in V2 and so forth, I confess that these are paper
tigers, with a lot of integer MIPS, in want of an application.
Aggregate "D-MIPS" is not an application!
I suppose my pet hand-wavy application for these concept chip-MPs is
lexing and parsing XML and filtering that (and/or parse table
construction for same) -- see http://www.fpgacpu.org/usenet/re.html).
Let me set the stage for you.
Imagine a future in which "web services" are ubiquitous -- the internet
has evolved into a true distributed operating system, a cloud offering
services to several billion connected devices. Imagine that the current
leading transport candidate for internet RPC, namely SOAP -- (Simple
Object Access Protocol, e.g. XML encoded RPC arguments and return
values, on an HTTP transport, with interfaces described in WSDL (itself
based upon XML Schema)) -- imagine SOAP indeed becomes the standard
internet RPC. That's a *ton* of XML flying around. You will want your
routers and firewalls, etc. of the future to filter, classify, route,
etc. that XML at wire speed. That's a *ton* of ASCII lexing, parsing,
and filtering. It's trivially parallelizable -- every second a thousand
or a million separate HTTP sessions flash past your ports -- and
therefore potentially a nice application for rack full of FPGAs, most
FPGAs implementing a 100-way parsing and classification multiprocessor.
Gray Research LLC
I very much agree that efficient large parallel systems have to
provide the right ratio of resources. I also think that control is
just one of the resources to share, and in many cases has to be
shared among multiple function units for best results. From this
point of view, using multiscalar PEs is just a natural extension of
> You're right, and I should have addressed multiple issue PEs -- but (IYes, or maybe you should use VLIW PEs with an issue width of 10 in
> think) so am I. Even in a three issue VLIW, if you only need multiply
> one out of every ten instruction issue slots (3 or 4 cycles), maybe you
> should be sharing your multiply unit with other instances of your PE.
that case. For applications with a large amount of parallelism, PEs
with an issue width of around 10 seems to be a sweet spot (beyond
that width basic block sizes tend to become a problem and cycle times
start to suffer). Also, at that width you can usually avoid the
complexities of sharing resources between PEs (except memory of
course, which presents a rather hairy problem all by itself).
Note that there exist VLIW architectures that scale nicely to 10+
issue widths, by using distributed register files and limited
interconnect/bypassing. Also, note that if the amount of parallelism
available is 'embarassing' enough to keep lots of single-issue PEs
busy, this parallelism usually maps well to (significantly cheaper)
VLIW or even SIMD architectures. For example, consider how many apps
> We'll see. :-) To some extent, it comes down to this. Some usefulBut it seems to me that the latter will nearly always be less
> resource, done right, is too expensive to assign to each PE. Either we
> do a stripped down and limited subset of resource that is *just*
> affordable per PE, or we leave out the resource, and then *share* an
> instance, done right, amongst 3 or 5 or 10 PEs.
efficient than sharing said resource with several others within one
multi-issue PE. The single, centralized control will not only be
simpler and cheaper, but also more efficient because of the
> I know from staring at it that a proper data cache, that also does byteI fully agree; however, this works just as well for a VLIW. On a
> and halfword alignment, and so forth, rivals the size of an austere
> processor data path, and only gets used every 2 or every 3 instructions.
> *** Particularly in an implementation fabric that is block RAM port
> constrained ***, it is very tempting to me to share one data cache
> between 2 or 3 PEs.
10-issue VLIW you may have just 2 slots for load-store (and none of
the overhead for sharing).
> The implementation cost of the sharing is someWouldn't you like to be able to use multiple shared resources
> muxing and some arbitration logic, and perhaps some address-space-ID
> tags and tag checking. And as I note in the article, once you have paid
> for the muxing and the arbitration logic per PE, maybe you don't have to
> pay for it over and over again as you share additional resources.
independently (say, one PE accesses the multiplier, another the
barrel shifter, and a third shared memory, in the same cycle)?
Otherwise you'll be severely limiting the utilization of these
expensive resources! Implementing parallel access to these is much
simpler (and cheaper) when there is a single control source...
> In any event, my goal was to convey the wide applicability of theAnd I'm trying to make you see control as yet another resource that
> concept, and how deeply it may lead you to rethink architecture in a
> multiprocessor -- not to specifically champion this or that resource as
> something that must be shared.
can, and often should, be shared - to the point where it obviates the
need for sharing most other resources.
> 1) I don't think VLIWs are the best fit technology-mapping andI'm looking forward to rebutting that future write-up ;-). Note that
> area-wise for an FPGA implementation. That is, I believe, for
> example, that a three issue LIW will be less efficient (MIPS/LUT)
> than three instances of a single issue RISC, even though the latter
> instances incur three sets of instruction fetchers, PC incrementers,
> etc. I'll try to explain that more, and explore the LIW design
> space, in a subsequent write-up.
the particular variety of VLIW that I think has the most potential
uses explicitly programmed bypasses, which leads to a huge reduction
in bypass (i.e. bus/selector) cost for wide configurations.
> 2) Past about 3-issue, it seems very hard for a compiler to keep allThis depends on how much of the parallelism can be extracted as
> those issue slots busy with useful work, so they don't scale up
instruction-level parallelism (ILP). Most applications with abundant
parallelism map very well to 10+ issue widths (VLIW or SIMD/vector),
at least if coded reasonably. The 3-issue ILP limit you mention is
for essentially *sequential* code.
For that matter, are you aware of any compiler that can keep more
than 3 _PEs_ busy? (And if you feel it's okay to manually
parallelize code for multiple single-issue PEs, then it also seems
reasonable to manually schedule VLIWs...)
> and then you are back to MPs.Well... What if your PEs have to communicate/synchronize a lot?
> And wider issue VLIWs need a heroic compiler research program, which
> I don't have the resources to chase. (Since all I have is a hammer,
> everything looks like a nail to me.)
Many apps require that to be able to use the parallelism. Now,
within a single instruction stream, synchronization is implicit and
communication is easy and cheap. So with VLIW PEs, you can go a long
way by partitioning for minimum communication, and scheduling the
operations within a node. Yes, that's hard, but is it harder than
targetting your massively parallel shared-memory MP? Do you have a
compiler which can target one of your shared-memory clusters (from
your regular sequential HLL source)?. Isn't this even _harder_ than
scheduling for VLIW? Aren't you just accepting that you'll have to
program your intra-cluster communication explicitly, and would it not
be fair, then, to accept manual scheduling for a VLIW PE also?
Your approach is beautiful and simple in concept. However, I'm not
buying your arguments that it is particularly efficient or easy to
use :-). That's true only for apps which map trivially onto that
particular structure, but then the same could be said for VLIWs or
any other programmable structure...
On a final note, there are several VLIW compilers/schedulers more or
less freely available, each with its particular limitations of course
(I don't have up-to-date pointers, but can ask a colleague for them
if you like).
> I suppose my pet hand-wavy application for these concept chip-MPs isVery interesting indeed!
> lexing and parsing XML and filtering that (and/or parse table
> construction for same) -- see http://www.fpgacpu.org/usenet/re.html).