RE: [fpga-cpu] Fpga and Cpu cores
- A list of some FPGA CPU cores is at www.fpgacpu.org/links.html.
To date I have not seen a published performance comparison chart. Were
anyone to publish a performance chart it would be instructive to state
* performance on some standard set of benchmarks. SPEC or EEMBC would be
nice except these are (IIRC) not inexpensive to do. Dhrystones would be
better than nothing, I suppose. Benchmarks consisting of unavailable-source
company-written inner loops are useless except for demonstrating peak
* size of core (logic cells and block RAMs)
* time to run each of the benchmarks (what really matters)
* harmonic mean time averages over the benchmarks
* frequency (over worst case temp and voltage conditions) (interesting but
* instructions per clock (interesting but not comparable)
* host system, including detailed description of memory subsystem latencies
* measured peak power and total energy required to run the benchmarks
* whether the data are simulated or measured on real machines
* how the core was prepared -- straight compilation of shipped source or, at
the other extreme, manually tweaked in the FPGA Editor?
For a while to come, expect apples-to-oranges data that warrant considerable
skepticism. Company #1 will present simulated results for their core for
their fastest speed grade parts (expensive unobtainium), running entirely
on-chip, with programs and data in on-chip block RAM, on their best case
inner loops. Company #2 will present measured results in the context of a
real low-cost system using last year's slowest-speed grade device, running
standard benchmark programs out of external RAM.
As an inadequate starting point, see
www.fpgacpu.org/usenet/a-x-announce.html (size comparisons),
www.fpgacpu.org/xsoc/xr16.html (xr16 in SpartanXL-4: "25 MHz" - "40 MHz")
www.fpgacpu.org/xsoc2/log.html ("60 MHz" so far in Virtex-4),
basecase at "41 MHz" in VirtexE-8), and
http://www.altera.com/document/ds/ds_excnios.pdf (Nios at "up to 50 MIPS and
50 MHz", presumably in fastest 20KE). A while back Damjan Lampret of
opencores.org claimed >100 MHz simulated frequencies in the fastest speed
grade of VirtexE (if I recall correctly) but I haven't seen any more recent
data now that their OR1K core is further along.
All this clock frequency data is inadequate for serious comparison purposes
because almost no one is stating instructions per clock data, and more
importantly, because the instruction sets are not comparable. (#1's
instructions may well do 30% more work per instruction than #2's). For
example, some 16-bit instruction word architectures (like xr16) require 2
instructions to form a 16-bit constant whereas some 32-bit instruction word
designs need only one. Another example, I can design a stack machine (even
a Java machine) that really screams, frequency- and IPC-wise, but if it
requires four instructions to fetch 2 local variables, add them, and store
the result to another local, it could underperform a 3-operation RISC
machine with twice the cycle time.
Gray Research LLC
- --- In firstname.lastname@example.org, "Jan Gray" <jsgray@a...> wrote:
> A list of some FPGA CPU cores is at www.fpgacpu.org/links.html.Part of the problem is the architecture design also is a large
> To date I have not seen a published performance comparison chart.
>Were anyone to publish a performance chart it would be instructive to
>state performance on some standard set of benchmarks.
unknown in bench marking.So many CPU's so,few details.
For example here is a model of what I belive
to be the FASTEST cpu possible: a THREE stroke single address
1) Fetch the current instruction. (Memory access)
2) Calculate the effect address of the memory operand
3) Load/Store from Memory Data(Memory access)
Since benchmarks results are never looked there is no
need for any ALU operations performed with the data. Since
nobody wants to admit a instruction takes 3 cycles there
is strong pressure to make things look like they take 1 or less
cycles per instruction. RISK... um RISC machines try to
make step 3 vanish by have lots of registers on chip.
Pipelining of instructions tries to make step 2 vanish.
Cacheing makes step 1 vanish. Harvard machines (what RISC's want to
be) do steps 1 and 3 at once. Immedate operations save steps 2,3.
Forth machines ignores step 2. And so forth with other computer
designs. CPU designs have to guess at what is the best design
for the real world and that can be hard.I hope this model with
help with the Apples and the Oranges of computing.
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers"