Here's a little thought experiment that I'm thinking of... Any ideas/solutions/considerations would be useful.
The current solution:
So I have a stack of PCs (8-10 PCs), each with their own hard-drives, memory, cpu, network card, OS, etc... They each run a single real-time process (written in C/C++ currently compiled with .NET compilers) and share their memory via UDP ethernet packets. Multi-threading is generally not used. The hard-drive is only used during initialization. Each PC has it's own OS, drivers, and overhead which adds to the number of interrupts. The UDP-transferred packets are synchronously merged into each machine's shared memory at the start of each real-time cycle.
The potential streamlined solution:
A single PC with multiple PCIe x16 buses, each running a graphics card. Each real-time application is transferred to a GPU.
- less hardware, less expensive
- memory sharing is not dependant on network performance which gives more reliable/predictable response when timing dependant messages are sent.
Questions: Operating System: do we need to load a small kernel on each GPU or should we use the same one used by the CPU? Compilation: How can we compile the existing C/C++ code (platform independant) for the specific GPU's architecture. Can we? Transferring: How do we transfer the instructions to the GPU upon load-time? Memory management: Should the shared memory be stored on each graphics card DDR5 memory, and sync'd by an application on the CPU? Or should the shared memory be in the global DDR3 memory and accessed directly by the graphics card? Synchronous timing is an issue here. PCIe limits: The socket of the CPU will generally determine how many PCIe slots can be used at a time. What is the limit and how high can we go? If we have multiple CPUs on one Mobo, can we increase the number of PCIe slots that can be available?
Some of the other problems:GPUs are better at parallel computing than CPUs. Real-time applications often cannot utilize excessive parallel processing because the outcome of the real-time simulation needs to be extremely predictable. Going in this direction would require a complete rewrite of the existing software which isn't feasible for me. CPUs are better at serial computing than GPUs. For that reason, I would expect a performance drop when using a GPU, but I'm not sure how significant this is and if this is a risk to the project. Our previous configuration (entire PC/OS per real-time process was overkill).
Edits: Compiling: C++ Amp, Minotaurus, PGI, and CUDA looked promising until I realized that they only transfer specific parts of my code to the GPU for accelerated processing. It requires an entire code-rewrite and doesn't achieve what I want. It's looking tough to find that magic "-gpu" option that lets me transfer an entire binary to the GPU.
CPU cards: Some research showed me the TILEncore-Gx36 card. It uses a stand-alone CPU instead of a GPU and PCIe x8 instead of x16 which gives some more flexibility. This may be a more feasible solution after looking at the difficulties associated with running non-acceleration code on a GPU. The trick now is still mastering the load-time transmission of binaries to the PCIe card and the run-time synchronization of shared memory.
CUDA looks interesting, but from what I gather, it's an accelerator. If I had something like this:
1 2 3 4
for (int i = 0; i < a; ++i)
for (int j = 0; j < b; ++j)
for (int k = 0; k < c; ++k)
x[i] = y[j]*z[k]+x[j];
As a programmer, I would need to identify this as a case which can be accelerated with parallel computing, #include CUDA, then re-write this using the CUDA API. It's not the same as taking an existing application, re-compiling it for the GPU architecture and running the entire application on that target.
I've also used OpenCL a few times. The syntax is based on C, but the way you program for a GPU is very different than the way you write regular code.
You can use OpenCL from many different languages, but you have to write the kernel itself in OpenCL's language.
When utilizing the power of the GPU, one of the most important things to figure out, is what parts should go on the GPU, and what parts should not. Memory transfer is a major bottleneck, and not just based on how much memory you transfer, but also how often you transfer chunks of memory.
You only want to run something on the GPU if, you're processing large enough chunks of data to make the memory transfer worth it, and problem is highly parallelizable. You always want to keep memory on the GPU as long as possible. On the GPU you have hundreds or maybe even thousands of threads to use. It's not that easy to write a good GPU kernel. The range of optimization, or inefficiency, can be huge, depending on how well it's written.
You can't just run a regular program on the GPU, you can just find parts of a regular program that you make make parallel and use the GPU for, similar to how you would with normal concurrent programming.
If your bottleneck is not something you can speed up on a GPU, then it wont really help you much.