Working heavily on pixel shaders the last few days, I came to realize some facts about the current way of treating the GPUs and how that way is inefficient/impractical and really doesn’t allow us to fully harness the powers of GPUs.
The progress of GPUs
When some years ago the first graphics accelerators came to light, the GPU was a fixed wiring that would rasterize pixels on the video memory. You could only change the input to that rasterization pipeline. That changed with the coming of the GeForce3. That GPU gave the programmers the ability to somehow change that fixed wiring. You had access to smaller parts of the pipeline and you could do some routing of the data between whose components. That was fun…
Today we can actually run code on the GPUs. The rendering pipeline is no longer stiff. It is flexible. You can consider each of these pipelines as a little CPU, a CPU with only a small number of SIMD instructions. Modern GPUs have many such “CPUs” that can run in parallel. All getting data from a common video memory, but each of them writing to each own spot(the assigned pixel for example). So what we have is a parallel computer of specific topology for the CPUs. And this is how the astronomical million floating-point operations numbers are achieved.
Making use of parallel algorithms
With this functionality you can do almost whatever you like. Working the last days on the HDRR implementation of Sylphis3D I managed to make with the GPU do all sorts of things, and I realized that I was actually trying to program a parallel computer.
Take for example that you try to find the max among n numbers. You do that with a loop that iterates over the numbers and compares it to the maximum number you found until now. This is an algorithm with O(n) time complexity. Which means that it scales linearly to the number of data. Suppose now that you have m CPUs how would you take advantage of them to bring the time complexity down? Simple… you assign 2 numbers to each CPU and each CPU outputs the larger number of the two. The output is fed in the CPUs that become available until one number remains. That will be the maximum. What happened to the time complexity? Suppose that you have n/2 CPUs, with each execution cycle you half the data. So to get to one number you need log n steps. Time complexity is now O(logn) (awfully faster than linear) at the expense of O(n) processors (as the parallel algorithms guys say).
The above algorithm comes natural with GPU programming. You apply all the values to a texture and write a program (shader) that takes for example 4 texels of that texture finds the larger and writes the output to the pixel of another texture that has half the dimensions. You repeat until you end up with a 1×1 texture that contains the maximum. The more pipelines (processors) the GPU has the more faster we go.
You realize now that today we actually have a powerful parallel machine in an APG/PCI slot on our PCs, that we can use to accelerate many things beyond graphics. What we don’t have? A descend way of programming that hardware.
The restricting programming model of textures, vertex buffers, etc
The sad thing is that there is no good/nice way to program the GPU. It seems that the fact that programmable GPUs are supposed to be an extension of the fixed function GPUs of the previous decade, is restricting us to have to work with concepts that are bound to graphics, concepts the first GPUs where designed around.
To make the simplest thing you are forced to go thought textures, vertex buffers and other irrelevant things. To use the above example with the maximum number, just consider how is a simple floating-point value represented: As an 1×1 texture. To allocate this in OpenGL for example is a nightmare. Lines and lines of code to setup an FBO and bind a texture to it, when at the end of the day you just want to allocate a float in memory! It’s a bit overkill don’t you think?
What would be nicer would be to finally get a memory allocation routine for video RAM. The driver should finally expose the video memory management to the programmer. Raw… to allow you to do what you want with it. This is how GPUs work today; it’s just the API that gets in the way. The GPUs have grown so much over an API for much less programmable graphics hardware, that now seems like we are trying to fit a camel through a pins head.
With this kind of hardware I could do physics on the GPU, but I wont. I wont because I’m not masochistic enough to implement it through FBOs, textures and vertex buffers, especially when I know there is no real hardware restriction to do it that way…