Building a C++ SIMD abstraction (1/N)

Howdy everyone out there! After a bit of a hiatus from writing, CppCast recently had Jonathan Boccara on as a guest to talk about blogging which has inspired me to get back to expository writing again.

Recent developments in OSPRay since the summer have marked several deep changes and additions of our infrastructure, including some rewrites to MPI messaging code, scene graph implementation, and other base abstraction cleanups to help reduce some of the growing complexity in our code base. Some of these changes were also timely to capture certain deliverables over the summer. So to say the least, it’s been a BUSY year.

That said, some of the abstractions we are using in OSPRay may be worth sharing more generally at some point; each of which being useful to blog about their own. Perhaps those will become some future topics.

This post marks the beginning of a series of work I’m getting back to for OSPRay: pathfinding on what is possible in modern C++ to deal with efficient SIMD without sacrificing either expressiveness or readability.

Why consider SIMD?

The ‘why’ is important to understand before getting lost in the ‘what’. So why consider programming with SIMD? Performance!

Now not all performance problems can be solved with SIMD. There are 2 core aspects of the performance of code: latency and throughput. Both of these things can be very related and their application depends on what part of your code you are looking at and measuring.

I’ll define latency as “the time it takes for operation [X] to complete”. For problems that are completely linear, this usually comes out in code as “how many instructions am I executing”, which I think is a reasonable proxy (not a law) for thinking about latency of a function.

I’ll define throughput as “the number of [X] things that can be computed in Y amount of time”. Sometimes people use the term bandwidth to describe this: I (pedantically) stick with throughput as it better fits amount of things computed, where I associated bandwidth with transferring (not computing) things. However, both terms are fundamentally describing “how much stuff” in a “per-unit of time”.

The way these two things are related is that some latency problems can be solved with increasing bandwidth. For example, if I am computing a blur filter on a digital picture, the latency of computing an output pixel from an input image will be the result of some mathematical expression solving some equation. There’s an inherent latency for the instructions required to evaluate that expression. However, if there is a way to evaluate the expression for more than one pixel with the same latency as operating on a single pixel, then the throughput would be roughly increased linear to the number of pixels we could compute at one time (i.e. computed in parallel). This then reduces the total number of times we have to evaluate the expression, which would reduce the overall latency of the image filter.

Thus problems which have lots of the same operations being computed on many items of data (in parallel) will see decreased latency with increased throughput. This assumes, however, that the working set of data is larger than the throughput peak of the implementation, otherwise latency of the actual instructions (i.e. number of instructions) are the bottleneck…with me so far?

Before moving on, I’ll mention that this trend to using data parallelism to use more transistors in a compute device has been going on for well over a decade now and is a primary driver for performance on modern hardware. While I see this trend work primarily with CPUs (working at Intel), it is completely true of other hardware architectures such as GPUs and FPGAs. For instance, everything I’m going to talk about with CPUs is directly applicable to GPGPU programming as well: there’s a spectrum of throughput vs. latency optimizations in hardware between GPUs and CPUs, but they all are best fed by parallel code!

Don’t get too tripped up on the somewhat drastic different spellings between CPU and GPU code (which is VERY unfortunate), but the concepts for getting throughput are exactly the same between CPUs and GPUs.

So what is SIMD?

I’m not going to be comprehensive in an overview of SIMD, so I highly recommend using the web to find more complete descriptions of what SIMD is and what it can do for you. Regardless, I’ll try to make sure you at least see my working view of SIMD.

In case this is new to you, SIMD refers to Single Instruction Multiple Data. This is a way to label hardware instructions that can act simultaneously on multiple data items at once, resulting in more done per-instruction. In general, the adjective used to describe this is vector, where SIMD code uses vector instructions operating on vector registers. Thus the process of transforming non-SIMD code (scalar) into SIMD code is called vectorization.

A SIMD illustration: a single instruction stream operating on multiple input data creates multiple data results (illustration found on Google)

Scalar code is what we think as “normal” C++: when I write ‘float a = b + c;’, I expect that the variable ‘a’ will contain the single value resulting from adding the single values of ‘b’ and ‘c’ together. A vector version of the same statement would be something like ‘float8 a = b + c;’, where the single addition instruction would be applied to 8 items at a time and stored into a vector variable ‘a’. In this case, 8 isn’t particularly important but it should be noted that writing SIMD code will most efficiently be written in terms of native SIMD register sizes (e.g. 8-wide float registers on AVX/AVX2 CPUs). However addition is trivial, so the real design decisions about how to vectorize code center around how to get more complicated code (long expressions, control flow divergence, function calls, etc.) in that same form and doing so in ways that are less prone to error while preserving good performance.

Issues to consider with SIMD

In general, vectorizing code can be difficult as it makes you re-think various aspects of an algorithm:

Can my algorithm be parallelized with vector instructions?
Will using SIMD instructions change the calculated results?
How many input data can be processed simultaneously?

These questions are very specific to the algorithm and/or domain of the problem, meaning that there is no obvious answer that can cover all cases. However, there are others that are common among many domains and problem categories:

When I write long or complicated expressions, do I get good performance?
What if I need to work with a subset of values in a SIMD register?
What if I need to do an operation on all the items in a SIMD register?
How hard is it to cross function call barriers?
What confidence to I have in maintaining performance in the future?

I think this is a great way to evaluate the efficacy of a SIMD solution. In other words, I take the above questions and think “does [SIMD solution] make [aspect of programming] easier or more difficult?”.

Some aspects of programming used to judge SIMD abstractions will include:

Readability
Performance
Ease of writing correct code
Level of control (related to performance)
Ability to compose with with existing C++
Portability

Current Options in C++

On an abstract level, SIMD programming is hard because you must consider expressing your code with data parallelism in mind. This is ultimately what we want our code to say, but there are a number of ways that can be spelled, some better than others according to the above list.

I like to think of SIMD C++ abstractions in 4 categories (in no particular order):

Hand-coded intrinsics/assembly
Annotated C++ directives
Language extensions + non-standard tool chains
Libraries providing SIMD enabled types

Before getting to the SIMD abstraction I’m building (which is a library solution, #4), I want to walk through #1-3 and examine the pros and cons of each to better motivate why I’m working on yet another SIMD programming solution. I’ll make sure not to leave out the pros of each, because I’m not trying to say that what I’m doing is the perfect solution for everyone, but I think it is great when the other 3 are not ideal.

Conclusion

I’m excited to be sharing with folks what I’m working on to make SIMD programming easier in OSPRay, which is hopefully also useful for folks outside of the world of OSPRay. After taking my own advice originally given about thread tasking systems (see here), I did the research and found that I really want something different to what’s out there, both in the big picture and detailed view of various options I considered.

Hopefully my journey of writing SIMD code will be helpful to someone out there. However, I’m not even close to the biggest expert out there on SIMD, so if you have thoughts or input on this please comment or email me with feedback!

So until next time, happy coding!

Building a C++ SIMD abstraction (1/N) – Motivation

Why consider SIMD?

So what is SIMD?

Issues to consider with SIMD

Current Options in C++

Conclusion

Leave a comment Cancel reply

Why consider SIMD?

So what is SIMD?

Issues to consider with SIMD

Current Options in C++

Conclusion

Share this:

Related

Leave a comment Cancel reply