atomic memcpy suggestion

Question

While testing a program for scalability, I came across the situation where I have to make my memcpy operation as atomic operation . I have to copy 64bytes of data from one location to other .
I came across one solution, that is using spinning over a variable is :

struct record{
    volatile int startFlag;
    char data[64];
    volatile int doneFlag;
};

and pseudo code follows

struct record *node;
if ( node->startFlag ==0 ) {  // testing the flag 
    if( CompareAndSwap(node->startFlag , 0 ,1 ) ) {  // all thread tries to set, only one will get success and perform memcpy operation 
        memcpy(destination,source,NoOfBytes);
        node->doneFlag = 1; // spinning variable for other thread, those failed in CompAndSwap 
    }
    else {
         while ( node->doneFlag==0 ) { // other thread spinning 
          ; // spin around and/or use back-off policy  
         }
   }}

Can this perform as atomic memcpy ? Though if thread performing memcpy gets preempted ( before or after memcpy but before setting doneFlag ), then others will keep spinning. Or what can be done to make this atomic .
Situation is like other thread must have to wait unless data get copied, since they have to compare with inserted data, with their own data .
I am using test-and-test-and-set approach in case of startFlag to reduce some costly atomic operation. Spin-locks are also Scalable, but i have measured that atomic calls give better performance than spin-lock, moreover i am looking for the problems that can arise in this snippet. And since i am using my own memory-manager, so memory allocation and free calls are costly for me, so using another buffer and copy content in it, then setting pointer ( since pointer size is under atomic operation) is costly, since it will require many mem-alloc and mem-free calls.

EDIT I am not using mutex, because they doesn't seems to be scalable moreover this is just a part of program, so critical section is not this small ( i understand that for larger critical section it is hard to use atomic operations ).

Why do you use HTML for formatting your sourcecode? Have you not seen the formatting related buttons directly above the editor? — Sebastian Mach, Jul 15 '11 at 08:21

score 7 · Accepted Answer · answered Jul 17 '11 at 06:46

7

Your code snippet is definitely broken. There's a race on node->startFlag

Unfortunately, there's no atomic way to copy 64 bytes. I think you have number of options here.

Access node->startFlag in atomic fashion. I've written a couple of posts on the subject: here and here.
Protect entire thing with user-mode spinlock. Here's a post on the subject
Use RCU like approach. You can read about RCU here. In two words, the idea is to reference the buffer you want to copy using a pointer. Then you do:
1. Allocate new buffer.
2. Create it's contents (memcpy from your source).
3. Atomically substitute the buffer with new one.
4. Wait for all threads accessing old buffer to expire and free it.

Hope it helps. Alex.

answered Jul 17 '11 at 06:46

Alexander Sandler

2,078
2
19
21

actually i have maintained my own memory manager, so it's costly to call memory allocation and free-memory calls. So i can not use the 3rd mentioned by you. Can you comment about my approach, which i have shown in code snippet. Yes there is race on startFlag, but this is some way of test-and-test-and-set , where i am checking variable's value before to reduce some costly atomic operation. It would be great if you suggest me how my solution is going to get into trouble. Though i have already read all your posts regarding atomic-spinlock . – peeyush Jul 17 '11 at 16:49
and do spin-locks disable interrupts ? i don't think so .. in that case my design also looks okay.. – peeyush Jul 17 '11 at 18:00
Ok. Lets say your CompareAndSwap() is a atomic set and test function. Then what do you need the first if statement for? It breaks entire code snippet. Remember, once you access certain variable using atomic operations, you should always access it using atomic operations. Also, you don't really need this if statement. – Alexander Sandler Jul 18 '11 at 18:37
yes CompareAndSwap is an atomic operation. I used that if statement to reduce few costly compareAndSet ( or test-and-set ) . But I am using that variable as volatile variable moreover, that if-statement if got false, will prevent calling compare-And-Set , as we do in test-and-test-and-set ( extension of test-and-set ). see wiki http://en.wikipedia.org/wiki/Test_and_Test-and-set . What about the whole snippet, does it seems workable ? – peeyush Jul 19 '11 at 05:29

score 4 · Answer 2 · answered Feb 02 '15 at 20:51

It's late but just for others arriving to this question, the following is easier faster and put less pressure on the cache.

Note: I changed CAS to the corresponding atomic builtin in GCC. There is no need for "volatile", CAS introduces a memory barrier.

// Simpler structure
struct record {
    int spin = 0;
    char data[64];
};



struct record *node;

while (node->spin || ! __sync_bool_compare_and_swap(&node->spin , 0 , 1)); // spin
memcpy(destination,source,NoOfBytes);
node->spin = 0;

PS: I'm not sure if a CAS instead of node->spin = 0 could improve efficiency a little more.

score 3 · Answer 3 · answered Jul 22 '11 at 11:31

Dont use a lock, use a CriticalSection. Locks are heavy-weight, CriticalSections are extremely, extremely fast (just a couple instructions depending on platform). You did not specify an operating system and the info i post here is experienced in Windows, though other OS's should be similar.

You had some concern that CriticalSections might not be scalable enough for your purpose if they contain a lot of code? The underlying reason (and probably the argument of where you read that) is, that the CriticalSection cannot interleave in multiple threads quite as fine-grained if the threads hold on to the CS for a long time. You can avoid that by just wrapping the CS around only that part of your code that really needs to be atomic. On the other hand: If you use CS too fine-grained the percentual overhead will of course increase. This trade-off you can not avoid with any kind of synchronization.

You say that the atomic operation you need is a copy of 64 Bytes: In that case your synchronization-overhead with CS's will be negligible. Just try it. With the granularity at which you synchronize (around a single copy of 64 bytes or around say 4 of these copies) you can balance thread-interleaving granularity vs. percentual overhead by doing some experimentation. But in general: CS's are fast enough and scalable enough.

Not at all - a critical section means only one thread can run the copy code at a time. That absolutely **murders** scalability and its entire family in a bloody massacre. Nevermind the fact that a "critical section" under the hood is just some type of lock in the first place. — Andrew Henle, Dec 02 '22 at 12:53

Mihai Maruseac · Answer 4 · 2011-07-15T08:30:44.393

2

Use a synchronization mechanism. A mutex seems reasonable.

If you are concerned about scalability, try using a monitor.

edited Jul 15 '11 at 08:30

answered Jul 15 '11 at 08:19

Mihai Maruseac

20,967
7
57
109

You cannot escape blocking if you want to serialize your threads. By scalability I was thinking of the number of lines needed to be added in each thread for a single mutex. – Mihai Maruseac Jul 15 '11 at 08:43
Atomic operations can only work with small amounts of data, ussualy one register. Other than that, if you want an atomic op of longer size you would use a `LOCK` in assembly (not for every programmer). A normal userspace programe will use sync mechanism if the data is longer. – Mihai Maruseac Jul 15 '11 at 09:07
agree, atomic operations are provided for pointer size memory location, but there are ways like Double Atomic commonly known as DCAS and extension are Word Software transactional memory and Object software transactional memory ( though not widely accepted ). So I was looking for atomic memcpy since data to be copied is small ( 64 byte ) – peeyush Jul 15 '11 at 09:16
Yes, STM methods avoid locking. But in this case it is a pain to employ them just for a single memcpy. – Mihai Maruseac Jul 15 '11 at 09:24
so what do you think ? what can make this scalable ? Can you comment something over the code I have shown that seems to be lock-free . and also delete your older comment to keep us discussing , since SO does not allows chatting/discussion ( one to one) over here . Have any answer ? – peeyush Jul 16 '11 at 05:21
@Peeyush: Be aware that shared memory multiprocessing doesn't scale very well in the first place. Using separate memory arenas for each thread and minimizing sharing helps a lot, but going to full message passing is about the only thing that scales properly (i.e., to more processors than will fit in one backplane…) – Donal Fellows Jul 24 '11 at 17:05

atomic memcpy suggestion

4 Answers4