Parallel Computing With CUDA Extensions (Part 2)

Parallel Computing With CUDA Extensions Part 2

A “kernel” in CUDA terms can be thought of as a series of instructions to be carried out
by a computation unit on a GPU. Basically a regular program.

1. You write kernels / code as if it where regular serial (top to bottom) programs, like
those designed to run on a single thread.

2. You tell the GPU to launch this code, and it will do so on multiple threads. (you can define how many)

But won’t this lead to the same code being executed multiple times?
Yes, it will, but that will be explained later.
If you write code to output “Hello World” and ask the GPU to run this on 50 threads,
you’ll be greeted back 50 times, and it will all happen in parallel.

But let’s say that your program code contains a vector of 50 floats.
For each of those floats you want something done, the same way, as fast as possible..
You tell the GPU to spawn 50 threads to work on this kernel (program code).

Inside the kernel when run on the GPU, each thread can connect to one vector member, and have full control over which thread works with what member of the vector.

Each thread starts doing the work as instructed in the code received from the CPU.

As an example, let’s say that when running the code on a CPU only, the CPU would have to traverse the vector members one by one, do the job needed, and continue on to the next member to do the same.
The total execution time for the task would vary based on how busy the CPU was and other factors, but let’s assume that we have 50 members that each needs 10 ms to be processed.

Easily this would take 50 x 10 ms (500 ms) to complete, as we work in a non-parallel way.
If we compare this to how the execution would be done in a GPU assisted way,
the time needed to process each element might be a bit higher, because of the general fact that the working unit will not be as fast as a regular CPU thread, so let’s say 20 ms per member.

The difference is that because these tasks are all started in parallel, they would finish processing the whole vector of 50 members in just 20 ms compared to the CPU, that would need to use 10 ms x 50 members, giving us 500 ms!

To not loose focus, it might help
to visualize situations in programming that could benefit from being able to do several equal tasks at the same time.

One thing that comes to my mind is in image editing applications. When you have an image consisting of millions of pixels, there will be several thousand pixels that share the same characteristics / properties, like color and brightness.
If you where to write a function to lighten or change the color of all those equal pixels, you’d basically have a job that could benefit from being executed simultaneously, rather than doing the same thing to each pixel in a linear fashion.

Usually, when programming using only the CPU, launching and running threads in parallel is considered an expensive and cumbersome activity.
The whole point of using the GPU as a processing unit for “regular” tasks is that it’s very good at certain things, like these two:

1. Launch a lot of threads (and “a lot” is MANY, think thousands)
2. To actually run these threads in parallel

So GPU’s makes perfect candidates for doing the kind of processing that’s lacking in regular CPU’s.

For those learning about programming, maybe as a student or on their own, I seriously believe that there will be heavy demand for competent C/C++ programmers that knows how to program using GPU assistance soon, and also into the unforeseeable future.

C and C++ might be lower-level than the languages you find most comfortable to use, but the truth is that even though these statically typed compiled languages has experienced a drop in general interest the last ten years, they’re now on the rise again thanks to technologies like this and because of the importance of power consumption / watts per cycle on modern handheld devices.

C++ is the most efficient language to use for low power consumption devices (if done right) compared to any other high-level language in existence today, and many large companies invests huge sums of money to the driving forces behind these languages now.

The future is mobile and the future is (hopefully) green.
To achieve this, we also need to start making software that’s green and environmentally friendly.

I hope this article has made you more interested in learning about GPU assisted processing using tools such as CUDA or OpenCL.

There’s more in the world than an Apple.

Parallel Computing With CUDA Extensions (Part 1)


Parallel Computing With CUDA Extensions (Part 1)

First, let’s see how to rate a CPU in a parallel way of thinking.

Let’s say we have an eight Core Intel CPU.

With eight cores, you can execute 8 operations (Wide AVX vector operations) per core,
and each core has support for running two threads in parallel via Intel “HyperThreading” technology, so you get:

8 cores * 8 operations/core * 2 threads and end up with what’s called
“128-Way Parallelism”

For more about AdvancedVectoreXtentions (AVX) in CPU’s, check this page.

Programming without taking advantage of ANY multithreading / parallel processing
techniques, means that for each program you run, you use

2/128 = 1/64 of your CPU’s total resources (including the automatic “HyperThreading”).

In an ordinary C/C++ program you can only run code that uses the CPU as
the computing resource.
If people really took advantage of their cores and threading capabilities, this would
probably be enough for most regular applications, but for applications that does a lot of
heavy calculations, like video / image processing or 3D graphics it’s way better if you could
offload some of these tasks to the simpler (in terms of instructions), but well capable GPU(‘s) in your machine.

One way to do this is through the use of CUDA extensions.

In this model, the CPU is considered the “HOST” and each GPU is a “DEVICE”
in your system that can be used for doing calculations.
When such a program is compiled, instructions for both the HOST and any DEVICE
is created.
In CUDA the GPU/DEVICE is seen as a “CO-PROCESSOR” to the CPU/HOST.
The processor also assumes that the HOST and DEVICE has access to separate physical
memory where they can store data.
The DEVICE memory is typically a very high-speed block of memory, faster than the one
on the HOST.

The HOST is “In charge” in CUDA and sends messges to the DEVICE telling it what to do.
The HOST keeps track of:

Moving data:
1. From CPU memory -> GPU memory
2. Grom GPU memory -> CPU memory
CUDA’s version of C’s memcpy() is cudaMemcpy()
3. Allocating GPU memory
Again CUDA uses cudaMalloc() instead of malloc()
4. Launch “kernel” on GPU (in CUDA, the HOST launches “kernels” on the DEVICE)

A Typical flow in a CUDA Application would be something like:

1. CPU runs cudaMalloc on GPU
2. CPU copies input data from CPU->GPU with cudaMemcpy
3. CPU launches the transfered “kernels” on GPU (kernel launch)
4. CPU copies results back with cudaMemcpy

So, what is this “Kernel” stuff all about?

Guess we’ll find out in part 2 of this series…

The series on Modern C programming has a new home!

Future content-posts for the series has a new home!

I’m a big fan of Evernote.
I use it for pretty much everything, from recipes, book chapter drafts, digital scanning of all my physical snail mail, project organizing and to-do lists to (from this day) blog editing!

I’ve found a incredibly useful (for me at least) service called Everblog, that lets me write and edit my blog posts straight from within the Evernote application.

This will be a huge timesaver for me, as I use Evernote for all my resource gathering and pre-press work for my planned book, “Real C, Real Value – A modern approach to C programming” and for my other writing projects.

So, all future posts to the series I plan on publishing to the group will live there, and discussions on the topics will live on LinkedIn.
I’ll use that blog and this group for gathering feedback to my articles which again will turn into chapters for the book on C.

My list on planned practical projects for the book is something in the lines of:

1. “A gentle introduction to the C language, C11 style”
2. “Introduction to Socket programming in C”
3. “Cross platform GUI programming in C”
4. “Connecting the sockets and GUI into a usable program”
5. “Introduction to graphics and image processing in C”
6. “Using a game engine”
7. “Writing a simple 2D game with what we’ve learned so far”
8. “C Programming and sound processing”
9. “Adding sound to our game”
10. “Creating a graphical Mp3 player in C”
11. “Platform specific challenges in multi-platform programming”
12. “A theory chapter on algorithms and the Standard C Library”
13. “What’s Next?”

As you can imagine, there is a lot of writing to do, and I need a place to get feedback on the way, hence the LinkedIn group and the blog.

I’m off to Copenhagen for the weekend, but I’ll post the next update on Monday”


C Programming For A Modern Era : Introduction To TCP/IP Sockets Programming

So, this is the first post in a series on programming with C, using the latest C11 standard, officially known as ISO/IEC 9899:2011.

I’ve read many books about programming. Nowadays I mostly read books on topics that don’t deal with specific languages, but with code optimization, refactoring, design patterns, team management and the likes.

But I’ve read plenty of programming language specific books on C, C++, C#, Java, Python, Ruby, Erlang and Perl and common to most of them is that the examples they provide are very arbitrary in the sense that they are not very useful for real world programming.

I believe a book is better written to cover a language using some sort of common theme/application, such as graphics, network or any other field-specific programming topic.
That way you get to introduce the language in a useful context, allowing the reader to immediately produce something useful, instead of creating a poorly implemented generic Customer/Orders based application they will hopefully never use in real life.

So, I’ve started working on a book called “Real C, Real Value” that addresses this issue. I’m prepared to use roughly two years writing it, as I do this on my spare time. I work as a C#/C++ developer and have more than enough to keep me occupied during working hours.

Please send me an e-mail if you’d like to be a technical reviewer or proofreader, as I’m planning to publish this book in electronic format to begin with, contacting a publisher only when and if the book becomes popular.

Well, now let’s get started on the topic of this post: Introduction to TCP/IP Sockets Programming using C11.

Why did I choose this topic?

First off, I’ve worked as a TCP/IP instructor, so the topic interests me. Second, networking is all around us all the time, literally speaking.
Third,  it makes for a good topic to present new language constructs and techniques in the new C11 standard. Fourth, it is of utmost importance to know TCP/IP and Sockets programming (or at least TCP/IP) if you plan on creating applications that share information with others, in other words if you plan on writing anything that does not live in complete isolation.

I’ll use this first post to introduce the TCP/IP protocol for readers not familiar with the basics of networking on a programmatic level. In a sense this post will cover the bare minimum knowledge needed to follow along with this blog series.

It will not contain any real code yet, but I promise that my next post will!

Ok, enough chatter.

Introduction to the TCP/IP networking protocol(s)

I’m not going to write about the whole history of TCP/IP. There are numerous articles on the web that does this in a remarkably good way, such as Wikipedia. I’ll explain the basics of the protocol in words I believe will serve to gain the understanding of the topic needed to follow my posts, nothing more, nothing less.

Please understand that this introduction only serves as a minimum common denominator of knowledge needed to follow my posts, and that each topic will get more substantial when we start working with the elements in code.

The protocol defines two major parts, TCP and IP.
We’ll start off in the back end with IP.

The IP part of TCP/IP

IP stands for “Internet Protocol” and defines the rules for transferring data from one computer to the next. Mark that I say from computer to computer, not from application to application. The latter is the work of TCP.

An IP adress takes the form: in version 4 of the protocol. These are fields of 4 bytes each, totalling 32 bits of actual data per address. the valid ranges for each octet is 0-255. An example IP address would be something like, which is probably the default internal address of your home router/modem used to connect to the internet. Just issue a “ifconfig” request in your *NIX/OS X shell of choice or an “ipconfig” in your cmd prompt on Windows to reveal the information. Here is a restricted sample of my current computers ifconfig:

WorkBook% ifconfig
lo0: flags=8049<up,loopback,running,multicast> mtu 16384
 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
 inet netmask 0xff000000
 inet6 ::1 prefixlen 128
en1: flags=8863<up,broadcast,smart,running,simplex,multicast> mtu 1500
 ether 00:26:bb:0b:c4:65
 inet6 fe80::226:bbff:fe0b:c465%en1 prefixlen 64 scopeid 0x5
 inet netmask 0xffffff00 broadcast
 media: autoselect
 status: active

This illustrates that my local IP address on the active interface is The 192.168.x.x  and 10.x.x.x series of addresses are part of a special range of addresses known as “private” addresses. These addresses are not routable on the internet, but works fine on a local network such as behind a router or ADSL/Cable connection.

The reason why you can connect to the internet when your computer has a private address is because your router/modem acts as a “forwarder” of the data sent from your computer.
It accomplishes this via something called “Network Address Translation” or NAT for short. Your router will have an external, valid internett address obtained from your ISP while connecting and it uses that address to forward any messages coming from its local network. Your local address is kept in a lookup table on the router, so that more than one machine on the same local network may use the same router/modem.
A router uses an IP -> MAC address mapping internally to make sure that the data is sent to the correct receiver, given that a machine on the local network could in theory change IP address from the time a request is made and to the time that the router receives the reply. Though this rarely happen, it is a possibility that the router must be prepared to deal with.

So what is this MAC address? It is NOT an Apple product. It stands for Media Access Control, and is a “unique” series of numbers identifying each physical network interface (think USB Wireless, Network Card etc) that is hard-coded into each interface. From my sample ifconfig output, you can see that my MAC address is:

ether 00:26:bb:0b:c4:65

Minus the “ether” this is a HEX address defining this interface uniquely.

So a mapping in the router may be visualized as: -> 00:26:bb:0b:c4:65

This mapping is kept in something called an ARP table, short for Adress Resolution Table.

The other address worth noting from the ifconfig output is the address.
This is the “Local Loopback” address, another special IP address that’s not routable on the internet.
This address serves internal purposes for the protocol and will be explained in greater extent when we start programming. For now remember that: ANY DATA SENT TO THE LOOPBACK ADDRESS IS IMMEDIATELY RETURNED TO YOUR LOCAL MACHINE, without involving the router.
it is present on every host and can be used even when a computer has no other interfaces (i.e. is not connected to the network). The loopback address for IPv6 is 0:0:0:0:0:0:0:1 (or just ::1).

A related class contains the link-local, or “auto-configuration” addresses. For IPv4, such
addresses begin with 169.254. For IPv6, any address whose first 16-bit chunk is FE80, FE90,
FEA0, or FEB0 is a link-local address. These addresses can only be used for communication
between hosts connected to the same network; routers will not forward packets that have such
addresses as their destination.

IP keeps data flowing from one location to another via connected routers, that again connects local networks. In general this is known as WAN’s (Wide Area Networks) and LAN’s (Local Area Networks).

Here is a visual representation of the general dataflow:


An IP data packet consists of multiple parts that identifies various properties of the data it carries and about the package itself. It has (among others) a header, body and tail part that identifies where the packet comes from (origin), what the package contains (body) and where it’s headed (destination) among other things. Here is a visual representation of the IP packet and its parts:


The protocol version described here is V4 of TCP/IP.
As you can see from my sample ifconfig output there is now a new kid in town called IP V6. It has a completely new implementation and is still in its early stages regarding widespread use.
But, every device created the last years include support for the V6 of the protocol, so it’s pretty evident that this will be the “next big thing” to happen to TCP/IP.

We’ll cover IP V6 in later posts, but a little bit of information is provided here.

The V6 version of IP has a different addressing scheme, and looks something like this:


IP V6 uses 16 bytes for addressing and is by convention represented as groups of hexadecimal digits, separated by colons.
Each group of digits represents 2 bytes (16 bits) of the address, totalling 64 bits. leading zeros may be omitted.

We’ll get much deeper into IP when we start programming, but this should serve as a sound basis.

The TCP (and UDP + SCTP) parts of TCP/IP

TCP stands for Transmission Control Protocol, and is responsible for the accurate delivery of data between applications running on different computers. UDP stands for User Datagram Protocol and is also responsible for delivering data, but does not contain any error checking, so it accepts package drops, something TCP does not. UDP is suitable for things like streaming video, as the packages received are not dependent on each other for delivering video. If some packages drop, you’ll just end up with a video image of slightly poorer quality.

The newer Stream Control Transmission Protocol (SCTP) is also a reliable, connection-oriented transport mechanism. It is message-stream-oriented (not byte-stream oriented like TCP) and provides multiple streams multiplexed over a single connection.

Both TCP and UDP use addresses, called port numbers, to identify applications within hosts. TCP, SCTP and UDP are called end-to-end transport protocols because they carry data all the way from one program to another (whereas IP only carries data from one host to another)

The main thing relating to TCP/IP you need to understand from this post is that it takes two parts for successfully communicating between computer programs, a network address and a port number (plus off course, the data to transfer).

Here’s a visual representation of the packets involved and their content placeholders:

Just to make things clear:

TCP is a lot more than what I’ve covered here, but this is the basic understanding needed to program with the C Sockets API.

For completeness I include this visual view of the TCP HeaderPacket:

I hope that you’ve found this post interesting and somewhat informative.

It may look complicated at first, but it’s really not that bad. Especially compared to 3D programming with advanced linear algebra and stuff like that.

In my next post I’ll cover some basics of Sockets and how to start using the C Sockets API.

Post Script:

I found this explanation for “What happened to TCP/IP 5” after some google’ing:

IPng, Internet Protocol next generation, was conceived in 1994 with a goal for implementations to start flooding out by 1996 (yeah, like that ever happened). IPv6 was supposed to be the “god-send” over the well-used IPv4: it increased the number of bytes used in addressing from 4 bytes to 16 bytes, it introduced anycast routing, it removed the checksum from the IP layer, and lots of other improvements. One of the fields kept, of course, was the version field — these 8 bits identify this IP header as being of version “4″ when there is a 4 in there, and presumably they would use a “5″ to identify this next gen version. Unfortunately, that “5″ was already given to something else.

In the late 1970’s, a protocol named ST — The Internet Stream Protocol — was created for the experimental transmission of voice, video, and distributed simulation. Two decades later, this protocol was revised to become ST2 and started to get implemented into commercial projects by groups like IBM, NeXT, Apple, and Sun. Wow did it differ a lot. ST and ST+ offered connections, instead of its connection-less IPv4 counterpart. It also guaranteed QoS. ST and ST+, were already given that magical “5″.