Thread Parallelism (Part 2): A Simple Parallel-For Wrapper

Last week I wrote about how you should think about using an existing solution for a tasking system before you embark on building one. If you have not read that post yet, I encourage you to do so and watch the associated talk which helped inspire it from Sean Parent (“no synchronization primitives“).

In this post, I am going to show you how you can easily wrap all the parallel-for calls listed last week in a way which lets you substitute different tasking systems without needing to change your code…pretty cool!

Why would I wrap parallel-for?

I think the biggest reason to wrap parallel-for is for flexibility in parallelized libraries, as I approach the topic as a library writer. My day job project (OSPRay) is an open source ray tracing library primarily for Intel CPUs (including Xeon Phi). The project aims specifically at interactively rendering scientific visualizations with the ability to scale to any number of CPU cores and even scale across multiple nodes on HPC cluster machines. In order to best compose with other multi-threaded applications, it is important that OSPRay does not collide with a tasking system solution used by the calling application. For example, if the application is already using OpenMP we want the ability “hook into” the existing OpenMP thread pool. Furthermore OSPRay already has this problem itself, as it relies on Embree for and ray intersection, which by default uses TBB.

Another, less common reason to wrap parallel-for is for experimentation. I think code which is easy to “play with” exhibits good design characteristics: good abstractions enable code to be reconfigurable and reusable without “getting in the way”. This also lets you do much easier comparisons to existing tasking systems if you end up writing one yourself. Should you embark on writing a tasking system and your code follows the form of a parallel-for, add yours underneath the wrapper too…and please share if you get good results!

The last reason harks back to my first technical post on avoiding human inlining. The idea of a parallel-for is generic: I want each iteration of the given for-loop to run in parallel. Thus a wrapper lets the caller say “schedule these iterations in parallel” in a uniform way. Unfortunately we do not have consensus among the aforementioned tasking systems for how parallel-for is spelled, nor do we yet have a parallel-for in the C++ standard. Yes, I do know that there’s the parallel STL TS currently in progress, but those mandate the use of containers which legacy code bases may not be prepared for quite yet.

The parallel_for() interface

The interface to a parallel_for() is fairly straightforward, as I’ve written it in OSPRay as follows:

#include "TaskingTypeTraits.h"
#include "parallel_for.inl"

template <typename TASK_T>
inline void parallel_for(int nTasks, TASK_T&& fcn)
{
  static_assert(has_operator_method_with_integral_param<TASK_T>::value,
                "parallel_for() requires the implementation of "
                "method 'void TASK_T::operator(P taskIndex), where P is of "
                "type unsigned char, short, int, uint, long, or size_t.");

  parallel_for_impl(nTasks, std::forward<TASK_T>(fcn));
}

NOTE: I know that this interface assumes a starting index of 0. It is trivial to add other overloads that let you specify ‘start_index’ and ‘end_index’, and even have a ‘chunk_size’. If you need those, simply have the above version call into “parallel_for(int start, int end, int chunk_size, TASK_T&&)” with the appropriate default values.

Well then, it really is that simple! However if I left it at just the above code, this discussion would be super boring. Let’s look at the 2 parts: 1) the parallel_for_impl() function and 2) the has_operator_method_with_integral_param type trait.

The parallel-for implementation

The reason to hide the implementation inside of another function called parallel_for_impl() is because it is visually busy and not something a caller needs to be concerned about. Here’s my version of it:

#include <utility>

#ifdef OSPRAY_TASKING_TBB
#  include <tbb/parallel_for.h>
#elif defined(OSPRAY_TASKING_CILK)
#  include <cilk/cilk.h>
#elif defined(OSPRAY_TASKING_INTERNAL)
#  include "ospcommon/tasking/TaskSys.h"
#elif defined(OSPRAY_TASKING_LIBDISPATCH)
#  include "dispatch/dispatch.h"
template <typename TASK_T>
inline void callFcn_T(void *_task, size_t taskIndex)
{
  auto &task = *((TASK_T*)_task);
  task(taskIndex);
}
#endif

template<typename TASK_T>
inline void parallel_for_impl(int nTasks, TASK_T&& fcn)
{
#ifdef OSPRAY_TASKING_TBB

  tbb::parallel_for(0, nTasks, 1, std::forward<TASK_T>(fcn));

#elif defined(OSPRAY_TASKING_CILK)

  cilk_for (int taskIndex = 0; taskIndex < nTasks; ++taskIndex) {
    fcn(taskIndex);
  }

#elif defined(OSPRAY_TASKING_OMP)

# pragma omp parallel for schedule(dynamic)
  for (int taskIndex = 0; taskIndex < nTasks; ++taskIndex) {
    fcn(taskIndex);
  }

#elif defined(OSPRAY_TASKING_INTERNAL)

  struct LocalTask : public Task {
    const TASK_T &t;
    LocalTask(const TASK_T& fcn) : Task("LocalTask"), t(fcn) {}
    void run(size_t taskIndex) override { t(taskIndex); }
  };

  Ref<LocalTask> task = new LocalTask(fcn);
  task->scheduleAndWait(nTasks);

#elif defined(OSPRAY_TASKING_LIBDISPATCH)

  dispatch_apply_f(nTasks,
                   dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT,
                                             0),
                   &fcn,
                   &callFcn_T<TASK_T>);

#else // Debug (no tasking system)

  for (int taskIndex = 0; taskIndex < nTasks; ++taskIndex) {
    fcn(taskIndex);
  }

#endif
}

NOTE: C++17’s new constexpr-if can clean this up a lot by removing the need for the preprocessor here. However, OSPRay requires C++11 for now, so that cleanup will have to wait.

While the code looks ugly, the general form of the implementation is fairly simple: the application defines the correct preprocessor value (assuming all include paths and link dependencies are met by the build system) to get the desired implementation. You’ll also notice that OSPRay kept its old tasking system around and is now callable with the same syntax as all the other options. This lets us always have the option of having a very controlled tasking system when we need it, though it hasn’t been used for quite a while now (TBB remains the best performer for us).

The other cool thing to notice is that there is a ‘debug’ option. This can be very valuable when debugging because sometimes it is easier to see a stack trace that is both guaranteed to be deterministic and does not go through the tasking system API layer. This is almost, but not quite the same as setting your tasking system’s thread pool size to 1, where the work is still run by a thread managed by another library’s API. Thus the wrapper also lets you debug your code with no tasking system and with no added headache to the callers of parallel_for()…yay!

Checking the task’s operator() signature

When I wrote this code, I really wanted to heed Scott Myer’s advice on interfaces: make the interface easy to use and difficult to misuse. To me, that means in the case of parallel_for(), I want to make sure that the TASK_T::operator() has the correct signature at compile time. If not, I want to provide an exact error message to guide users to make the right correction.

As a side note, this brings up an interesting side topic that may warrant a future post by itself: it sometimes annoys me when generic code which takes a closure function does not give me an error message which helps me know what to do if I got the closure’s signature wrong. In this case, parallel_for() expects that TASK_T::operator() takes a single argument, where the argument satisfies the std::is_integral type trait. I am not a template meta-programming expert by any stretch of the imagination, but I did get something working. The type trait has_operator_method_with_integral_param takes another trait I crated (has_operator_method) and does the parameter checking.

Let’s start with the has_operator_method type trait:

#include <type_traits>

template <typename T>
struct has_operator_method
{
  template <class, class> class checker;

  template <typename C>
  static std::true_type test(checker<C, decltype(&C::operator())> *);

  template <typename C>
  static std::false_type test(...);

  using type = decltype(test<T>(nullptr));
  static const bool value = std::is_same<std::true_type, type>::value;
};

The gist is that we do SFINAE using a overloaded function test() to return either std::true_type or std::false_type depending on whether the passed in type contains operator(). If that makes your head spin a bit, that’s ok…I stole/adapted the above from a blog post I read when searching the web for a solution. If you want more gory details, read that post.

Next comes figuring out a way to check if operator() takes a single, integral argument. In the same file as the first type trait, I added the following:

template <typename T>
struct has_operator_method_with_integral_param
{
  template <typename P>
  using t_param    = void(T::*)(P) const;
  using byte_t     = unsigned char;
  using operator_t = decltype(&T::operator());

  using param_is_byte     = std::is_same<t_param<byte_t>  , operator_t>;
  using param_is_short    = std::is_same<t_param<short>   , operator_t>;
  using param_is_int      = std::is_same<t_param<int>     , operator_t>;
  using param_is_unsigned = std::is_same<t_param<unsigned>, operator_t>;
  using param_is_long     = std::is_same<t_param<long>    , operator_t>;
  using param_is_size_t   = std::is_same<t_param<size_t>  , operator_t>;

  static const bool value = has_operator_method<T>::value &&
    (param_is_byte::value     ||
     param_is_short::value    ||
     param_is_int::value      ||
     param_is_unsigned::value ||
     param_is_long::value     ||
     param_is_size_t::value);
};

NOTE: I’ve realized that this only takes callable objects for TASK_T and not function pointers. You could probably use std::enable_if to handle when TASK_T is a function pointer and not a callable object, but I will leave that up to y’all to figure out. Alternatively, you could also just skip checking TASK_T’s signature, but that would go against the suggestion to do it at all…

Unfortunately I was unable to figure out how to extract the type of the first parameter in TASK_T, making it impossible to use std::is_integral to do the check. Instead I took a brute-force approach of simply checking a set of exact types. There do exist Boost type traits for functions, but they do not work for methods on a class. If you know how to solve this problem, please share with me how to do it! I may do a brief follow-up post on a revised version of the above trait class.

The basics on this type trait is that we can directly compare decltype(&T::operator()) with some known pointer-to-member types. Thus the value for the trait ends up being the combination of has_operator_method and the method type comparisons. It is not super comprehensive, but it covers the needs of people working inside OSPRay right now.

An example of parallel_for() in practice

Now that we have the parallel_for() wrapper, let’s have a look at an example which uses it in OSPRay. Whenever a frame is rendered we divide up the frame into 64×64 pixel tiles, then divide up the tiles again into smaller units of work which then call into the renderer itself to trace all the rays to create pixel colors. Using parallel_for() combined with C++11 lambda functions, the code which schedules all the frame rendering work looks like:

float LocalTiledLoadBalancer::renderFrame(Renderer *renderer,
                                          FrameBuffer *fb,
                                          const uint32 channelFlags)
{

  void *perFrameData = renderer->beginFrame(fb);

  parallel_for(fb->getTotalTiles(), [&](int taskIndex) {
    const size_t numTiles_x = fb->getNumTiles().x;
    const size_t tile_y = taskIndex / numTiles_x;
    const size_t tile_x = taskIndex - tile_y*numTiles_x;
    const vec2i tileID(tile_x, tile_y);
    const int32 accumID = fb->accumID(tileID);

    if (fb->tileError(tileID) <= renderer->errorThreshold)
      return;

    Tile __aligned(64) tile(tileID, fb->size, accumID);

    parallel_for(numJobs(renderer->spp, accumID), [&](int tIdx) {
      renderer->renderTile(perFrameData, tile, tIdx);
    });

    fb->setTile(tile);
  });

  renderer->endFrame(perFrameData,channelFlags);

  return fb->endFrame(renderer->errorThreshold);
}

It’s nice that all work scheduled per-frame in OSPRay fits in such a short function. The outer parallel_for() takes care of scheduling a task for each tile in the frame buffer, then the inner parallel_for() creates sub-tasks on individual tiles. While surrounding code isn’t terribly important to understanding how parallel_for() is being used here, the important takeaway is that the parallel_for() wrapper keeps this code from needing to care about how those tasks get scheduled. Reasoning about all of the jobs running during a frame fits in one place on one page.

Just for fun, have a look at LocalTiledLoadBalancer before we had this wrapper: it was pre-C++11 code so it had to create explicit task types (no lambdas) and it only used the tasking system we had implemented in the project (header file and associated cpp file). C++11 and parallel_for() let us reason about tasking in one location instead of spreading it out among single-purpose classes in multiple methods.

An important aspect of the above renderFrame() function is that we nest parallel_for(). This gives the tasking system more granular tasks to work with, enabling some implementations do better load balancing. However, I’ll save that discussion for my next post where I’ll look at some performance measurements of using the different tasking system backends in OSPRay.

Final thoughts

I hope that this little parallel_for() wrapper is both useful by itself and provides insight in how you may be able to make code both easier to understand and harder to misuse. The idea of wrapping tasking system interfaces doesn’t stop at parallel_for(), as you can do single-shot task scheduling, implement a similar interface to std::async() using futures to get the value back from the task, and more.

Just for fun, I created a small project on Github which implements just this problem as a little header library. If you are interested in seeing this within OSPRay itself, you can find them in this directory within the OSPRay repository. Also, If you end up writing some generic parallel algorithms using a parallel_for() wrapper, please share your code and your findings!

As I mentioned before, my next post will (very briefly) look at performance of parallel_for() inside of OSPRay…so stay tuned for part 3!

Until next time, happy coding!