Thread Parallelism (Part 1): Don’t Write a Tasking System

My last post discussed how it’s better to avoid human inlining in order to make code clearer, with help from a talk Sean Parent (Adobe) gave at Going Native 2013. This post is the first of a 3-part series on thread parallelism with off-the-shelf tasking systems. The other parts will look at ways of making your code agnostic to any one of them and some brief performance results in OSPRay.

That same talk had multiple parts, where I’d like to focus on the section discussing the guideline to have “no synchronization primitives“. Instead of hashing out the same things that are in that talk, I’ll spin it in a way which hits closer to home…don’t write a tasking system.

…but why?

Now that may feel a bit harsh to some of you at first, and I know that there are lots of people out there capable of writing good tasking systems. However, the point I’m going to make comes from a simpler goal: there’s tremendous value in seeking out what solutions already exist for a problem and see if they are sufficient. If you take that goal to heart, you’ll discover some great libraries out there ready to use, and you’ll get insight into what others have already tried.

Don’t get me wrong, there are good reasons to write something yourself. One of the most obvious is that you may want to do it for the learning experience. Chapter 6 of Robert Martin’s book Clean Coder makes some great points on the value of practicing the act of programming. In my eyes, writing a solution to a problem you have is great practice for seeing both the needs of the user (your code which needs the library) and the design decisions of the implementation. However, make sure that the code you produce doesn’t blind you from making better engineering decisions about other code you work on. Value practice for being better at coding, not making it an excuse to use your solution over another, especially if an existing library has obvious benefits.

You also may consider writing your own solution because there are licensing issues, or that the organization you work in mandates you can’t use other libraries. Those situations are out of your control, so they aren’t a part of this discussion.

I’ll get to the crux of the argument by looking at it through the lens of economics. Whether you see time itself as a currency (i.e. if you work on code on your own time) or view time as having a monetary value connected to it via your employer (i.e. it’s your job to write code), the economics of using other people’s code is very valuable. You add maximal value to a code base per unit of time by minimizing the amount of complexity you must deal with. Thus I’m arguing that for code which needs a tasking system adds maximal value if you’ve spent minimal time on using an existing one. If you’re skeptical and think that you can write one faster than using an existing one, then you may be in for a surprise!

When should I write one?

If you’ve carefully considered existing solutions and ruled them out for technical reasons, then BRAVO…you’ve done a wonderful thing! The suggestion to not write a tasking system is a default position to take for the masses, not a rule to blindly follow which has no exceptions.

To me, the best evidence against using an existing solution is to have concrete measurements showing an insufficiency. Too many times I’ve heard engineers say statements like “I wrote it so it’s good” or “I’ve spent so much time optimizing the code, it can’t go any faster”. Heck, I’ve said those things myself…I think we have all been in situations where we want to defend our code first before considering someone else’s work. At the end of the day, however, numbers won’t lie: whether you show run-time performance measurements or lines of code reduced by using a library, they will be something that everyone can equally examine together.

Defining what we are after…what is a ‘tasking system’?

Multi-threaded programming has been the way to get maximal performance out of CPU code since multi-core CPUs have been mainstream (it’s been at least 10 years). Hartmut Kaiser (Stellar Group @ LSU, HPX) had a great talk at Meeting C++ 2014 discussing why threads are too low of an abstraction for parallelism (talk on YouTube here). I think that thread abstractions, such as std::thread, can be good for some concurrency problems, but that will be a discussion for another time.

I define ‘tasking system’ as “a pool of worker threads with an interface for scheduling work on them”. There are some distinct advantages to this. First, you can better manage the system’s resources by only having as many threads as there are cores on the machine. This is much more efficient than oversubscribing the machine with too many OS threads. Second, you end up using much more expressive interfaces to say in code where the parallelism is exploited. Just like it’s better to factor out loops and give them a name for the algorithm they implement (see last post), it’s better for code to simply say what’s going to be executed in parallel on the thread pool.

The most fundamental abstraction that a tasking system provides is the concept of a ‘parallel-for’ loop. James Reinders (formerly Intel) said it himself in a talk on structured parallel programming : “you can build anything with a parallel-for” (talk at time of quote on YouTube here, I recommend watching the whole thing). If you think about it, it makes sense as most algorithms are typically implemented in terms of a for-loop, so writing parallel algorithms can be written in terms of a parallel-for.

Specifically, what a parallel-for implements is fork-join parallelism. Fork-join means that at a particular point in code, a set of tasks are split-off from the executing thread and are executed on the thread pool’s worker threads. After all the tasks have completed (joined), execution on the main thread is continued. There are other forms of parallelism, but this is the fundamental piece each solution provides.

The tasking systems I’ll mention here all implement a parallel-for loop abstraction, some which have slightly different spellings, but all implement the same logical construct.

Some tasking systems to try…

There are 4 tasking system solutions I’ve used with success: Intel’s Threading Building Blocks (TBB), Cilk Plus, OpenMP, and Apple’s libdispatch. I also know that Microsoft’s Parallelism Patterns Library (PPL) is another solution which is built into Visual Studio, but I haven’t played around with it at all as I mostly develop on Linux. My understanding is that it is comparable to TBB as a library solution and Windows folks should definitely have a look at it.

Again, each has the concept of a parallel-for construct, but each have additional features for both concurrency and parallelism. The discrepancies in supported features beyond parallel-for may give you reason to choose a particular tasking system solution, but I’ll start with their common component.

Given the for-loop:

for (int i = 0; i < NUM_ITERATIONS; ++i)
  computeIteration(i);

…each tasking system’s version of parallel-for is as follows:

// TBB
tbb::parallel_for(0, NUM_ITERATIONS, [](int i) { computeIteration(i); });

// Cilk+
cilk_for (int i = 0; i < NUM_ITERATIONS; ++i)
  computeIteration(i);

// OpenMP
#pragma omp parallel for
for (int i = 0; i < NUM_ITERATIONS; ++i)
  computeIteration(i);

// libdispatch
dispatch_apply_f(NUM_ITERATIONS,
                 dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0),
                 nullptr,
                 [](int i) { computeIteration(i); });

NOTE: I wasn’t able to get libdispatch calling an inline lambda function as I’ve shown above, but with a little extra fluff it generally will do the same thing.

Hopefully none of those are terribly difficult to understand, with libdispatch having the busiest interface. In terms of using an existing tasking system, it’s really that easy to use them, and with good performance!

Once you step beyond a parallel-for, however, things diverge fairly quickly: synchronization constructs, asynchronous function execution, futures, etc. Each one of the above solutions is worth at least an entire post, but I’ll leave that up to you to explore which suites your needs best. However, I’ll give you my quick 0.02$ on how I view them, which is hopefully helpful.

Threading Building Blocks

TBB is an open-source, cross-platform, C++ library put out by Intel. It now uses the very permissive Apache 2.0 license, making it very easy to add to existing (even proprietary) projects. By now, TBB pre-built binaries are available for Linux, macOS, and Windows and is available in most Linux distribution package managers and easily installed on macOS via Homebrew.

TBB uses a fairly sophisticated method of task-stealing to load balance worker threads evenly with tasks scheduled on them. In my experience, TBB does a fantastic job at figuring out how to keep all your cores busy, without any intervention necessary on the user to figure out how to schedule work “just right”.

TBB has several levels of abstractions: low level threads and synchronization primitives, thread pools and task arenas, and high level algorithms (i.e. parallel-for). I’ve found that TBB has the most items to work with, ranging from a ton of useful parallel algorithms, to allowing users to customize the tasks and task arenas to do very specific things. However, the ray tracing project I work on for my day job (OSPRay), TBB’s parallel-for does all of our thread-based heavy lifting and does it faster than anything else I’ve tried…which includes a well-optimized tasking system we had internal to the project!

Cilk Plus

Cilk+ is a solution which originally came out of MIT but was eventually acquired by Intel. Cilk+ is implemented as an extension to C++, providing extra keywords and built-in functions to schedule work on a thread pool.

Being that Cilk+ is built into the compiler, it’s very easy to start using it. It was first added to gcc in 4.8, and it’s not guaranteed to be built into a gcc shipped by a Linux distro, but I’m not sure if I’ve seen gcc not have Cilk+ shipped with it (CentOS/RHEL 7, Ubuntu, and Arch). On gcc, simply add the “-fcilkplus” compiler flag and you’re good to go! There has been work on getting it

It’s also important to note that Cilk+ also implements a sophisticated task-stealing mechanism for load balancing, so I’ve seen it run almost as fast as TBB. Perhaps TBB has been getting more optimization attention, but I don’t actually know.

OpenMP

In general, OpenMP is probably one of the most popular methods for getting a “free” tasking system. It relies on code annotation to tell the compiler what to parallelize. This is attractive to programmers with non-trivial code bases which were not built with parallelism in mind. OpenMP also works for C and FORTRAN, making its audience larger than just the C++ community, where OpenMP on C++ came after the aforementioned languages as it was originally made for HPC machines. Furthermore, annotation is something that makes code more portable as a compiler which doesn’t know about #pragma omp constructs, it can simply ignore them and still generate a correct (yet slow) program.

OpenMP is super easy to work with as it is also built into the compiler. It’s been in gcc for ages now, where parallel-for was the first construct implemented in the OpenMP standard. With gcc, simply add the “-fopenmp” compiler flag and have at it!

There’s one important aspect of OpenMP’s parallel-for that I wrote above: if your problem needs dynamic load balancing, you must add an additional clause. To achieve this, you would write it as:

// OpenMP
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < NUM_ITERATIONS; ++i)
  computeIteration(i);

This will do automatic load balancing at run-time instead of the default behavior of statically assigning loop iterations to worker threads. Again, OpenMP is way more in-depth than this, but that’s an easy “gotcha” if you’ve never used OpenMP before.

libdispatch

I am very new to using libdispatch. In fact, when I was working on my last post, I heard Sean Parent mention libdispatch several times, so I decided to go find out what it was all about. While I haven’t done much to play with it, my naive use of it seems to run as fast as OpenMP in OSPRay. If you are on macOS (or Linux too) and need a worthy tasking system, it seems to do the job just fine.

Final thoughts

I hope that your interest has been piqued to investigate what tasking systems exist out in the wild. Each presents a set of advantages, whether it be a large and expressive feature set or ease of access through a simple compiler flag. I hope the generation point isn’t lost on anyone: please checkout libraries which solve common problems as they will likely be good enough or even better than what can be written reasonably well. I’m happy for those of you which have carefully ruled them out for your code, but I really believe that’s a smaller percentage of us than we might believe.

The next part of this series will look at how you can wrap the concept of a parallel-for in your code, where you can then substitute different tasking systems without the need to change any of your code which calls it…pretty cool, eh?

Until then, happy coding!

One thought on “Thread Parallelism (Part 1): Don’t Write a Tasking System

Leave a comment