Parallel Computing With CUDA Extensions (Part 2)

Parallel Computing With CUDA Extensions Part 2

A “kernel” in CUDA terms can be thought of as a series of instructions to be carried out
by a computation unit on a GPU. Basically a regular program.

1. You write kernels / code as if it where regular serial (top to bottom) programs, like
those designed to run on a single thread.

2. You tell the GPU to launch this code, and it will do so on multiple threads. (you can define how many)

But won’t this lead to the same code being executed multiple times?
Yes, it will, but that will be explained later.
If you write code to output “Hello World” and ask the GPU to run this on 50 threads,
you’ll be greeted back 50 times, and it will all happen in parallel.

But let’s say that your program code contains a vector of 50 floats.
For each of those floats you want something done, the same way, as fast as possible..
You tell the GPU to spawn 50 threads to work on this kernel (program code).

Inside the kernel when run on the GPU, each thread can connect to one vector member, and have full control over which thread works with what member of the vector.

Each thread starts doing the work as instructed in the code received from the CPU.

As an example, let’s say that when running the code on a CPU only, the CPU would have to traverse the vector members one by one, do the job needed, and continue on to the next member to do the same.
The total execution time for the task would vary based on how busy the CPU was and other factors, but let’s assume that we have 50 members that each needs 10 ms to be processed.

Easily this would take 50 x 10 ms (500 ms) to complete, as we work in a non-parallel way.
If we compare this to how the execution would be done in a GPU assisted way,
the time needed to process each element might be a bit higher, because of the general fact that the working unit will not be as fast as a regular CPU thread, so let’s say 20 ms per member.

The difference is that because these tasks are all started in parallel, they would finish processing the whole vector of 50 members in just 20 ms compared to the CPU, that would need to use 10 ms x 50 members, giving us 500 ms!

To not loose focus, it might help
to visualize situations in programming that could benefit from being able to do several equal tasks at the same time.

One thing that comes to my mind is in image editing applications. When you have an image consisting of millions of pixels, there will be several thousand pixels that share the same characteristics / properties, like color and brightness.
If you where to write a function to lighten or change the color of all those equal pixels, you’d basically have a job that could benefit from being executed simultaneously, rather than doing the same thing to each pixel in a linear fashion.

Usually, when programming using only the CPU, launching and running threads in parallel is considered an expensive and cumbersome activity.
The whole point of using the GPU as a processing unit for “regular” tasks is that it’s very good at certain things, like these two:

1. Launch a lot of threads (and “a lot” is MANY, think thousands)
2. To actually run these threads in parallel

So GPU’s makes perfect candidates for doing the kind of processing that’s lacking in regular CPU’s.

For those learning about programming, maybe as a student or on their own, I seriously believe that there will be heavy demand for competent C/C++ programmers that knows how to program using GPU assistance soon, and also into the unforeseeable future.

C and C++ might be lower-level than the languages you find most comfortable to use, but the truth is that even though these statically typed compiled languages has experienced a drop in general interest the last ten years, they’re now on the rise again thanks to technologies like this and because of the importance of power consumption / watts per cycle on modern handheld devices.

C++ is the most efficient language to use for low power consumption devices (if done right) compared to any other high-level language in existence today, and many large companies invests huge sums of money to the driving forces behind these languages now.

The future is mobile and the future is (hopefully) green.
To achieve this, we also need to start making software that’s green and environmentally friendly.

I hope this article has made you more interested in learning about GPU assisted processing using tools such as CUDA or OpenCL.

There’s more in the world than an Apple.

Advertisements

Parallel Computing With CUDA Extensions (Part 1)

cuda_spotlight

Parallel Computing With CUDA Extensions (Part 1)

First, let’s see how to rate a CPU in a parallel way of thinking.

Let’s say we have an eight Core Intel CPU.
That’s:

With eight cores, you can execute 8 operations (Wide AVX vector operations) per core,
and each core has support for running two threads in parallel via Intel “HyperThreading” technology, so you get:

8 cores * 8 operations/core * 2 threads and end up with what’s called
“128-Way Parallelism”

For more about AdvancedVectoreXtentions (AVX) in CPU’s, check this page.

Programming without taking advantage of ANY multithreading / parallel processing
techniques, means that for each program you run, you use

2/128 = 1/64 of your CPU’s total resources (including the automatic “HyperThreading”).

In an ordinary C/C++ program you can only run code that uses the CPU as
the computing resource.
If people really took advantage of their cores and threading capabilities, this would
probably be enough for most regular applications, but for applications that does a lot of
heavy calculations, like video / image processing or 3D graphics it’s way better if you could
offload some of these tasks to the simpler (in terms of instructions), but well capable GPU(‘s) in your machine.

One way to do this is through the use of CUDA extensions.

In this model, the CPU is considered the “HOST” and each GPU is a “DEVICE”
in your system that can be used for doing calculations.
When such a program is compiled, instructions for both the HOST and any DEVICE
is created.
In CUDA the GPU/DEVICE is seen as a “CO-PROCESSOR” to the CPU/HOST.
The processor also assumes that the HOST and DEVICE has access to separate physical
memory where they can store data.
The DEVICE memory is typically a very high-speed block of memory, faster than the one
on the HOST.

The HOST is “In charge” in CUDA and sends messges to the DEVICE telling it what to do.
The HOST keeps track of:

Moving data:
1. From CPU memory -> GPU memory
2. Grom GPU memory -> CPU memory
CUDA’s version of C’s memcpy() is cudaMemcpy()
3. Allocating GPU memory
Again CUDA uses cudaMalloc() instead of malloc()
4. Launch “kernel” on GPU (in CUDA, the HOST launches “kernels” on the DEVICE)

A Typical flow in a CUDA Application would be something like:

1. CPU runs cudaMalloc on GPU
2. CPU copies input data from CPU->GPU with cudaMemcpy
3. CPU launches the transfered “kernels” on GPU (kernel launch)
4. CPU copies results back with cudaMemcpy

So, what is this “Kernel” stuff all about?

Guess we’ll find out in part 2 of this series…

Dynamic Parallelism in CUDA Version 5!

After using all my spare time on Blender lately, I’m now going to digress into another realm.

After reading a TechBrief at the Nvidia Cuda Developer Society I had to wrap my head around something other than modeling, just for a little while! 🙂

In CUDA Version 5, you can now call a CUDA kernel from within another, without going via the CPU.
The “parent” kernel will launch a “child” grid, which can itself also create new work to form an execution hierarchy. The “parent” will only signal as completed once all children are done processing.
The recursive depth will be dependent on your GPU resources.

So, Dynamic Parallelism in CUDA 5 enables the use of a CUDA kernel to create (and sync) nested work via the device runtime API for triggering other kernels, perform memory management on the device and create streams and events all without needing to use a single line of CPU code!
A CUDA Kernel can also call GPU Libraries such as CUBLAS directly, without any CPU intervention.

The Device Runtime API in CUDA C/C++ is a subset of the CUDA Runtime API for the Host, keeping the same syntax for easy code reuse.

Here is an example of calling a kernel from within a kernel:

__global__ KernelChild(void* data){
 //Do something
}
__global__ KernelParent(void *data){
 if (threadIdx.x == 0) {
 KernelChild<<<1, 32>>>(data);
 cudaThreadSynchronize();
 }
 __syncthreads();
 //Do something
}
// On Host
KernelParent<<<8, 32>>>(data);

Reducing the traffic between the GPU and CPU on the PCI bridge will bring a key performance boost for things like fluid dynamics simulations or similar stuff requiring pre-processing passes over the data.

GPU Computing rocks!

Happy Summer Holidays!