Loading ...
Sorry, an error occurred while loading the content.

Re: A fresh look at GPUs and OpenCL

Expand Messages
  • spoonsx21
    I had written some code to convert a feed forward neural network into a vertex shader (in GLSL) that calculates the CPPN queries in a massively parallel
    Message 1 of 7 , Jul 4, 2013
      I had written some code to convert a feed forward neural network into a vertex shader (in GLSL) that calculates the CPPN queries in a massively parallel fashion. One of the benefits being that you could run such a process inside of a browser using WebGL. Not to mention almost every platform has access to OpenGL and can run vertex shaders (anything over OpenGL 2.x).

      All you really need to do is compile a neural network down to the math functions the network represents. From there, you are almost as optimized as possible in software.

      I found that to boost query speed 10x inside of JavaScript on a MacBook Pro. I also think its probably one of the best optimizations possible on the software side. The next level of optimization would be to split up massive functions by dependencies and "layers" and run them across multiple servers each with multiple GPUs with some architecture to pool those queries together (similar to a map reduce). You would need a huge network for that to be worth it.

      However, for small CPPNs, compiling to pure math functions and running a shader is one of the least complicated speed ups, in my opinion. You don't have to deal with hardware, or CUDA compiling. You need code to convert a network to GLSL string, and then a program that loads in the shader sting, and can pull out an array of outputs after the calculations.


      --- In neat@yahoogroups.com, "Ken" <kstanley@...> wrote:
      > Hi Colin, one other interesting thing I'd throw in here is the looming bottleneck in neuroevolution research of extremely large networks. We're not really there yet today, but at some point making progress will sometimes require investigating networks with millions or more connections (maybe evolved by HyperNEAT or something HyperNEAT-like). Simply querying all the weights with the CPPN (which means activating the CPPN millions of times) could then become a big obstacle to running experiments. Even if there was some way to streamline just that one aspect of the algorithm it could make a big difference in the future.
      > Best,
      > ken
      > --- In neat@yahoogroups.com, Colin Green <colin.green1@> wrote:
      > >
      > > Hi all,
      > >
      > > I know the topic of GPU use has come up before but there have been a
      > > few recent developments in the GPU world so I thought it would be
      > > interesting to review the current situation.
      > >
      > > [CUDA]
      > > CUDA has been mentioned previously and I think Ken Lloyd did some work
      > > using CUDA in NEAT, but AFAIK none of the NEAT implementations freely
      > > available are using GPUs at all (please correct me if I'm wrong). CUDA
      > > was notable as being the first platform to provide a general computing
      > > platform/layer over GPUs rather than being graphics acceleration
      > > specific. As such it greatly lowered the difficulty of using GPUs for
      > > general computing. A notable point is that CUDA is specific to NVIDIA
      > > GPUs.
      > >
      > > [OpenCL]
      > > OpenCL is a more recent development that aims to provide an openly
      > > defined GPGPU style platform. OpenCL then is a layer of abstraction
      > > from the hardware that allows GPGPU style code to be written
      > > independently of any specific h/w and to be executed on any h/w
      > > suporting OpenCL. At this time there is already a lot of support, e.g.
      > > there is support for NVIDIA and ATI/AMD GPUs, IBM's Cell processor
      > > based accelerator 'blades', and you can also run OpenCL code on an
      > > 'normal' Intel multicore CPU (which may be more useful for
      > > testing/development than acceleration?).
      > >
      > > The main issue with OpenCL is that the it is an abstraction over
      > > diverse hardware, thus although a program may run it may not run very
      > > fast without specific knowledge of the underlying h/w and what it's
      > > strengths and weaknesses are. E.g. if code accesses more RAM that is
      > > available to each processor then OpenCL will simply compile in
      > > instructions to copy data between local and main RAM thus eliminating
      > > the perf gain of using local RAM. OpenCL does provide for querying the
      > > underlying h/w for some of these factors, so you could in principle
      > > perform a set of checks and report that the h/w isn't suitable for
      > > your program, or maybe even dynamically adjust the program code based
      > > on reported parameters.
      > >
      > > On the whole though I see OpenCL as a positive development and
      > > something the NEAT community can potentially benefit from. It is still
      > > a relatively young platform and therefore may present some challenges
      > > to code to as it develops, but I think it's mature and stable enough
      > > to consider experimenting with now.
      > >
      > >
      > > [Current GPU h/w]
      > > As a ballpark estimate of the sort of performance gains a GPGPU can
      > > give us, ATI/AMDs current flagship card (Radeon 7970) has a peak
      > > throughput of about 3.8 TFlops, compared to 100 GFlops for a 4th
      > > generation quad core Intel i7. So on paper we're looking at a possible
      > > 38x speedup compared to top flight CPUs. However, OpenCL does support
      > > utilising mutliple GPUs, e.g. in the Bitcoin mining world it's typical
      > > to have 4 and sometimes 5 GPUs in one system (using PCI 'riser' cables
      > > to distance the GPUs from the motherboard). So for a relatively modest
      > > investment you could be looking at a possible 100x speedup compared to
      > > current best CPUs.
      > >
      > >
      > > [NEAT and GPUs]
      > > My instinct here is to modify current NEAT code to report stats on how
      > > much time proportionally is being spent in each stage of the NEAT
      > > algorithm and to target the code that takes up the most time, this
      > > will be different across problem domains and also for NEAT versus
      > > HyperNEAT.
      > >
      > > Certainly if a problem domain is known to be CPU heavy (e.g. uses a
      > > physics simulation) then it's probably a no-brainer to use OpenCL for
      > > that in isolation from the rest of the NEAT algorithm. For NEAT itself
      > > I've observed slowdown as ANNs grow in size and this is presumably
      > > mostly due to time to decode and/or 'run' the ANNs, and this is of
      > > course a greater problem in HyperNEAT where the decode stage consists
      > > of a NEAT decode and ANN activation. So there might be some scope for
      > > using OpenCL there. One can envisage multiple GPUs where one may be
      > > dedicated to problem domain physics, one to ANN activation and another
      > > to ANN genome decoding (say).
      > >
      > >
      > > [Typical GPU Architecture]
      > > Finally I'm going to briefly describe the architecture of the Radeon
      > > 7970 to give an idea of what it is capable of.
      > > [Mainly taken from
      > > http://www.techradar.com/reviews/pc-mac/pc-components/graphics-cards/amd-radeon-hd-7970-1049734/review/2%5d
      > >
      > > The 7970 has:
      > >
      > > 32 x Compute Units (CUs). These are completely independent of each
      > > other. If you have 2x GPUs then OpenCL will see (I think) a block of
      > > 64 compute units, hence in some cases code can be accelerated just by
      > > adding GPUs. Each CU has:
      > >
      > > 4 x Vector Units (VUs). And each VU has:
      > > 16 x Unified shaders (unified here just means they are no longer
      > > specific to a task, e.g. pixel or vector shader, they are general
      > > purpose processors)
      > >
      > > So in total there are 32 x 4 x 16 = 2048 unified shaders.
      > >
      > > Each CU has 64kB of local RAM that all of the VUs can access
      > > (typically for reading shared state data I would guess). In addition
      > > each VU has it's own 64 kB of RAM (note. you would typically control
      > > what's in these local memories in code, that is, it's not a passive
      > > CPU cache). A vector unit is basically a SIMD processor, there is one
      > > set of instructions that are executed against all 16 shaders (so e.g.
      > > you can 'shade' 16 pixels at a time). So each of the 128 vector units
      > > can execute its own instructions, and in turn those instructions are
      > > operating on 16 shaders. A shader then consists of some minimal state
      > > data specific to it and the data it is operating on, and also
      > > execution units for performing arithmetic, etc.
      > >
      > > An interesting thing about vector units is that conditional branches
      > > are allowed in OpenCL, that is, you can have some shaders executing a
      > > different path despite there being only one set of instructions and
      > > one instruction pointer. However this is merely a trick, if VU code
      > > contains a branch then both branches are executed for all shaders and
      > > the shaders are assigned the correct final result based on which
      > > branch they should have followed. Hence it's advisable to avoid
      > > branches, but it's a nice feature to have available so long as you
      > > don't abuse it.
      > >
      > > For more info see:
      > > [From Shader Code to a Teraflop: How Shader Cores Work, Kayvon
      > > Fatahalian, Stanford University]
      > > [http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf%5d
      > >
      > > There's obviously a heck of a lot more to this subject than I've
      > > described but I thought this might be a reasonably good intro to
      > > current possibilities around GPU use in NEAT.
      > >
      > > Colin
      > >
    Your message has been successfully submitted and would be delivered to recipients shortly.