Parallel Computing With CUDA Extensions (Part 1)


Parallel Computing With CUDA Extensions (Part 1)

First, let’s see how to rate a CPU in a parallel way of thinking.

Let’s say we have an eight Core Intel CPU.

With eight cores, you can execute 8 operations (Wide AVX vector operations) per core,
and each core has support for running two threads in parallel via Intel “HyperThreading” technology, so you get:

8 cores * 8 operations/core * 2 threads and end up with what’s called
“128-Way Parallelism”

For more about AdvancedVectoreXtentions (AVX) in CPU’s, check this page.

Programming without taking advantage of ANY multithreading / parallel processing
techniques, means that for each program you run, you use

2/128 = 1/64 of your CPU’s total resources (including the automatic “HyperThreading”).

In an ordinary C/C++ program you can only run code that uses the CPU as
the computing resource.
If people really took advantage of their cores and threading capabilities, this would
probably be enough for most regular applications, but for applications that does a lot of
heavy calculations, like video / image processing or 3D graphics it’s way better if you could
offload some of these tasks to the simpler (in terms of instructions), but well capable GPU(‘s) in your machine.

One way to do this is through the use of CUDA extensions.

In this model, the CPU is considered the “HOST” and each GPU is a “DEVICE”
in your system that can be used for doing calculations.
When such a program is compiled, instructions for both the HOST and any DEVICE
is created.
In CUDA the GPU/DEVICE is seen as a “CO-PROCESSOR” to the CPU/HOST.
The processor also assumes that the HOST and DEVICE has access to separate physical
memory where they can store data.
The DEVICE memory is typically a very high-speed block of memory, faster than the one
on the HOST.

The HOST is “In charge” in CUDA and sends messges to the DEVICE telling it what to do.
The HOST keeps track of:

Moving data:
1. From CPU memory -> GPU memory
2. Grom GPU memory -> CPU memory
CUDA’s version of C’s memcpy() is cudaMemcpy()
3. Allocating GPU memory
Again CUDA uses cudaMalloc() instead of malloc()
4. Launch “kernel” on GPU (in CUDA, the HOST launches “kernels” on the DEVICE)

A Typical flow in a CUDA Application would be something like:

1. CPU runs cudaMalloc on GPU
2. CPU copies input data from CPU->GPU with cudaMemcpy
3. CPU launches the transfered “kernels” on GPU (kernel launch)
4. CPU copies results back with cudaMemcpy

So, what is this “Kernel” stuff all about?

Guess we’ll find out in part 2 of this series…