WinAPI _Interlocked* intrinsic functions for char, short

Question

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function. I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.

update 1

I'll try to simplify the question. How can I make this work?

struct X
{
    char data;
};

X atomic_exchange(X another)
{
    return _InterlockedExchange( ??? );
}

I see two possible solutions

Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X

First one is obviously bad solution. Second one looks better, but how to implement it?

update 2

What do you think about something like this?

template <typename T, typename U>
class padded_variable
{
public:
    padded_variable(T v): var(v) {}
    padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
    U& cast()
    {
        return *static_cast<U*>(static_cast<void*>(&var));
    }
    T& get()
    {
        return var;
    }
private:
    T var;
    char padding[sizeof(U) - sizeof(T)];
};

struct X
{
    char data;
};

template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
    var(): data(T()) {}
    T atomic_exchange(T another)
    {
        padded_variable<T, long> xch(another);
        padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
        return res.get();
    }
private:
    padded_variable<T, long> data;
};

Thanks.

Could you tell us _why_ you need to use the Interlocked* functions? It's very hard to suggest solutions without knowing what the problem is. Even the mighty internet can't seem to find _InterlockedExchange8, except for one post about the Windows DDK. — molbdnilo, Feb 22 '11 at 06:11
What do you mean? I need to use Interlocked functions for the purpose they are made for - atomic RMW operation with variable. _InterlockedExchange8 cannot be found in internet, that's what I said, but compiler still can find it. — ledokol, Feb 22 '11 at 06:20
If there's no documentation for a function anywhere, it's a definite sign that you need a different solution. Nobody can suggest other solutions than Interlocked* unless you say what problem you're trying to solve by using them. And "I need to do atomic RMW" is not your problem, it's your solution. If you provide some background on _why_ you need to do that you're more likely to get suggestions. — molbdnilo, Feb 22 '11 at 07:13
@molbdnilo: That's my problem, I need to do atomic RMW operation on specified variable. I really don't understand what do you want to hear. I have a variable which I want to change atomically from different threads. The function that changes this variable has to be thread-safe. — ledokol, Feb 22 '11 at 07:21
you wrote - `did your boss come to you and say "you need to do an atomic RMW operation on a variable"`. Yes, my task is to implement class which behaves like that. This is the exact problem that needs to get resolved, nothing can be changed here. Suppose I have function swap, which needs to atomically swap internal variable with another one and return the old one. This is template class, template argument might be int, short, long or some POD structure with size less then sizeof(long), otherwise it is not implemented. — ledokol, Feb 22 '11 at 08:11
@ledokol: so if it isn't possible, you'll get fired? ;) The point here is that you need to do it *in order to make something else work*. So what's important isn't really the atomic swap itself, but the *other* problem that it solves. — jalf, Feb 22 '11 at 12:14

score 2 · Answer 1 · answered Apr 04 '11 at 06:19

It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.

Example interlocked API implementation: intrin.asm

score 1 · Answer 2 · answered Feb 22 '11 at 05:33

1

Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.

Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.

answered Feb 22 '11 at 05:33

Ben Voigt

277,958
43
419
720

There are user interface functions which return this data type. If I store long internally I would need to use cast in every function that returns this variable. That is solution too of course, but then I need to think how to do portable cast from long to char (endianness may cause problems with simple cast or memcpy). – ledokol Feb 22 '11 at 05:37
Yes, forgot to mention - I need to use memcpy, as the class is template class and return type might be not integral type. It can be any structure T with sizeof(T) == 2 or sizeof(T) == 1, so memcpy should be used instead of static_cast here :( – ledokol Feb 22 '11 at 06:10
That sounds like an excellent reason to *not* use memcpy. memcpy isn't safe to use on arbitrary types in C++. – jalf Feb 22 '11 at 07:48
Yes, that's what I say :) I can't use static_cast, because it will not work for POD structures and I can't use memcpy, because it isn't safe. That's why in my question I prefer function that does actual operation on char. – ledokol Feb 22 '11 at 07:53
@Ben 128 bytes? Well, if you have a lot more items than threads, then you wouldn't need such big items for many algorithms. – David Heffernan Feb 22 '11 at 08:07
@David: If you have less than 128 bytes of data for a thread to work on, it would be faster (excluding really dense computations) to just process it sequentially and avoid the cost of starting a thread/borrowing one from a thread pool. And yes, that can be a single item, but most likely it's going to be an array of small ones. As long as the block size and alignment are multiples of 128 bytes. – Ben Voigt Feb 22 '11 at 14:29
@Ben Here's the example I'm thinking of. Consider a very long array of integers (doesn't matter how wide). Suppose you want to increment each element. You divide the array into equal sized blocks and let the threads operate on a block each. – David Heffernan Feb 22 '11 at 14:36
@David: You don't choose equal-sized blocks, you choose your block sizes to be a multiple of 128 bytes, and aligned. At least you do if you want to see any speedup from parallelization. The time necessary for one thread to process the extra "scraps" on the ends is much lower than the time that would be wasted under conditions of cache contention. – Ben Voigt Feb 22 '11 at 14:44
@Ben If the array is big enough to be worth getting threads involved, then the performance for the "scraps" on the end becomes irrelevant. What's more there isn't contention because, typically, you don't get one thread at an end whilst another is starting. – David Heffernan Feb 22 '11 at 14:45
@David: Because the scraps on the end are negligible, there's no reason NOT to divide the work at cache line boundaries. – Ben Voigt Feb 22 '11 at 14:47
@Ben Well I agree with that. Especially if you have a more complex example where contention is more likely to happen. – David Heffernan Feb 22 '11 at 14:53

score 1 · Answer 3 · answered Feb 22 '11 at 07:54

Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.

So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.

Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.

`Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.` I have a function which returns char (actually template T>, but it needs to return what it's template parameter is), I can't change that function, I just need to make it atomic. The cast is problem as described above. Is there any other way I can make char from long? — ledokol, Feb 22 '11 at 08:13

score 1 · Answer 4 · answered Feb 22 '11 at 09:48

1

Creating a new answer because your edit changed things a bit:

Use _InterlockedExchange8

Cast another to long, do exchange and cast result back to X

The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.

The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)

In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.

Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.

Does that cover all the cases you can encounter?

If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

answered Feb 22 '11 at 09:48

jalf

243,077
51
345
550

X is POD type. The size is restricted from 1 to 4 bytes (fits in long). – ledokol Feb 22 '11 at 09:51
And you need to do what with it, exactly? Swap it with another as an atomic operation? You mentioned `InterlockedAdd` in your question as well. – jalf Feb 22 '11 at 09:55
InterlockedExchange only (with this type of objects). – ledokol Feb 22 '11 at 10:14
The 1-byte case is the only really tricky one then. For the other cases, I'd write a template function which is specialized with SFINAE to call the 16- or 32-bit versions of `_InterlockedExchange`. The problem with using a wider version (say, `_InterlockedExchange16`) on a single byte, is that its parameter has to be aligned on a 16-bit boundary, which a byte might not be. But if you account for that and include either the following or the previous byte in the 16-bit chunk, then I suppose it would probably work, even if it's ugly. – jalf Feb 22 '11 at 10:53
The variable is aligned to 4 bytes boundary with __declspec (align(4)), but I didn't get what did you suggested to use for copying `long` back to `X`? – ledokol Feb 22 '11 at 11:37
I wasn't suggesting you copy it, as such, but simply that you write a template function which, if the input type is 16 bits wide, creates a `short*` and calls `_InterlockedExchange16`, and if it is 32 bits wide, creates a `long*` and calls `_InterlockedExchange32`. Something like `boost::enable_if` could be used to implement that. For actually converting the pointers, you can use `reinterpret_cast` ( should work in practice, but goes a bit beyond what the standard actually guarantees), or `static_cast` to a void pointer, and then again from void to long/short pointer. – jalf Feb 22 '11 at 12:13
Of course, if you use 16-bit swap on `char` sized data, then you'll end up swapping another byte of memory as well. Is that guaranteed to be acceptable? – jalf Feb 22 '11 at 12:16
`Of course, if you use 16-bit swap on char sized data, then you'll end up swapping another byte of memory as well. Is that guaranteed to be acceptable?`: That is the problem..but I think I can make it. Just adding one more dummy variable in the class I think will solve that. Let me think a bit... – ledokol Feb 22 '11 at 12:18
@jalf: Added some code, I guess this should be portable enough. What do you think about it? – ledokol Feb 22 '11 at 13:19

WinAPI _Interlocked* intrinsic functions for char, short

4 Answers4