Dynamic Parallelism in CUDA Version 5!

After using all my spare time on Blender lately, I’m now going to digress into another realm.

After reading a TechBrief at the Nvidia Cuda Developer Society I had to wrap my head around something other than modeling, just for a little while! 🙂

In CUDA Version 5, you can now call a CUDA kernel from within another, without going via the CPU.
The “parent” kernel will launch a “child” grid, which can itself also create new work to form an execution hierarchy. The “parent” will only signal as completed once all children are done processing.
The recursive depth will be dependent on your GPU resources.

So, Dynamic Parallelism in CUDA 5 enables the use of a CUDA kernel to create (and sync) nested work via the device runtime API for triggering other kernels, perform memory management on the device and create streams and events all without needing to use a single line of CPU code!
A CUDA Kernel can also call GPU Libraries such as CUBLAS directly, without any CPU intervention.

The Device Runtime API in CUDA C/C++ is a subset of the CUDA Runtime API for the Host, keeping the same syntax for easy code reuse.

Here is an example of calling a kernel from within a kernel:

__global__ KernelChild(void* data){
 //Do something
}
__global__ KernelParent(void *data){
 if (threadIdx.x == 0) {
 KernelChild<<<1, 32>>>(data);
 cudaThreadSynchronize();
 }
 __syncthreads();
 //Do something
}
// On Host
KernelParent<<<8, 32>>>(data);

Reducing the traffic between the GPU and CPU on the PCI bridge will bring a key performance boost for things like fluid dynamics simulations or similar stuff requiring pre-processing passes over the data.

GPU Computing rocks!

Happy Summer Holidays!

Advertisements