18

The context is Inter-Process-Communication where one process("Server") has to send fixed-size structs to many listening processes("Clients") running on the same machine.

I am very comfortable doing this in Socket Programming. To make the communication between the Server and the Clients faster and to reduce the number of copies, I want to try out using Shared Memory(shm) or mmaps.

The OS is RHEL 64bit.

Since I am a newbie, please suggest which should I use. I'd appreciate it if someone could point me to a book or online resource to learn the same.

Thanks for the answers. I wanted to add that the Server ( Market Data Server ) will typically be receiving multicast data, which will cause it to be "sending" about 200,000 structs per second to the "Clients", where each struct is roughly 100 Bytes. Does shm_open/mmap implementation outperform sockets only for large blocks of data or a large volume of small structs as well ?

Humble Debugger
  • 4,439
  • 11
  • 39
  • 56

4 Answers4

24

I'd use mmap together with shm_open to map shared memory into the virtual address space of the processes. This is relatively direct and clean:

  • you identify your shared memory segment with some kind of symbolic name, something like "/myRegion"
  • with shm_open you open a file descriptor on that region
  • with ftruncate you enlarge the segment to the size you need
  • with mmap you map it into your address space

The shmat and Co interfaces have (at least historically) the disadvantage that they may have a restriction in the maximal amount of memory that you can map.

Then, all the POSIX thread synchronization tools (pthread_mutex_t, pthread_cond_t, sem_t, pthread_rwlock_t, ...) have initialization interfaces that allow you to use them in a process shared context, too. All modern Linux distributions support this.

Whether or not this is preferable over sockets? Performance wise it could make a bit of a difference, since you don't have to copy things around. But the main point I guess would be that, once you have initialized your segment, this is conceptually a bit simpler. To access an item you'd just have to take a lock on a shared lock, read the data and then unlock the lock again.

As @R suggests, if you have multiple readers pthread_rwlock_t would probably the best lock structure to use.

psv
  • 688
  • 6
  • 19
Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
9

I once implemented an IPC library using shared memory segments; this allowed me to avoid a copy (instead of copying data from sender memory, to kernel space, and then from kernel space to receiver memory, I could directly copying from sender to receiver memory).

Anyway results weren't as good as I was expecting: actually sharing a memory segment was a really expensive process, since remapping TLB entries and all the rest is quite expensive. See this mail for more details (I'm no one of those guys, but got into such mail while developing my library).

Results were good only for really big messages (say more than a few megabytes), if you're working with little buffers, unix sockets are the most optimized thing you can find unless you are willing to write a kernel module.

peoro
  • 25,562
  • 20
  • 98
  • 150
7

Apart from what's been suggested already, I'd like to offer another method: IPv6 Node/Interface Local Multicast, i.e. a multicast constrained to the loopback interface. http://www.iana.org/assignments/ipv6-multicast-addresses/ipv6-multicast-addresses.xml#ipv6-multicast-addresses-1

At first this might seem quite heavyweight, but most OS implement loopback sockets in a zero-copy architecture. The page(s) mapped to the buf parameter passed to send will be assigned an additional mapping and marked as copy on write so that if the sending program overwrites the data therein, or deallocates the contents will be preserved.

Instead of passing raw structs you should use a robust data structure. Netstrings http://cr.yp.to/proto/netstrings.txt and BSON http://bsonspec.org/ come to mind.

datenwolf
  • 159,371
  • 13
  • 185
  • 298
  • Thanks for the links. The zero-copy reference was indeed helpful. I was not able to find out how RHEL6 treats Local Multicast ( from the reference point of zero-copy architecture ). Would you have any reference on that issue ? – Humble Debugger Jan 30 '11 at 03:05
  • @HumbleDebugger: RHEL is just another Linux distribution, and Linux is one of those kernels, that implements zero copy on the socket buffers. Sorry about answering this so late, but your original comment didn't pop up in my notifications, and I only got to see it today, when my answer was upvoted another time. – datenwolf Feb 12 '13 at 11:33
  • @HumbleDebugger: See http://vger.kernel.org/~davem/skb.html for an introduction into the details. – datenwolf Feb 12 '13 at 11:45
  • 2
    Having done it both ways more times than I care to count, using sockets on a new project for IPC for me would be like Gandalf's reservations to enter the mines of Moria. You just can't shake the feeling that you're going to run into a Balrog. COW is heavyweight if you frequently write to the pages, because then in addition to the copy you've got the TLB invalidate and as Linus puts it "you're squarely in the that sucks category". structs + shmem = easy and top performance, sockets + serialization = complex and slower. I don't know why so many people choose the latter. – Eloff Sep 11 '13 at 14:30
  • 1
    @Eloff: Because robustness and integrity do matter in IPC, whereas easy performance usually implies fragility, which is what you want to avoid in IPC. Yes, there are applications for SHM and there are situations where you need raw performance. But if what you desire are two processes communicating, without being able to step on each other's toes (think sandboxed workers) then a well channeled socket gives you a clear path of entry for new data to arrive. – datenwolf Sep 11 '13 at 15:33
  • 1
    Sure, but you'll end up with much more code. A simple shared memory solution with a simple locking scheme is easier to understand and less prone to bugs. But that's just my opinion and yours is obviously different. – Eloff Sep 19 '13 at 20:53
  • @Eloff: There is no "simple" locking scheme. Every locking scheme is a multiheaded beast. And without a well defined, context separating interface a serious race condition waits at every corner. I've been down that road too many times. Unless you're fighting for the very last clock cycle or are severely I/O limited shared memory should be avoided. Also the implementation of a properly designed zero-copy shared memory solution will take up much, much more code. – datenwolf Sep 19 '13 at 21:01
2

Choosing between the POSIX shm_open/mmap interface and the older System V shmop one won't make a big difference, because after the initialization system calls, you end up with the same situation: a memory area that is shared between various processes. If your system supports it, I'd recommend to go with shm_open/mmap, because this is a better designed interface.

You then use the shared memory area as a common blackboard where all processes can scribble their data. The difficult part is to synchronize the processes accessing this area. Here I recommend to avoid concocting your own synchronization scheme, which can be fiendishly difficult and error-prone. Instead, use the existing working socket-based implementation for synchronizing access between processes, and use the shared memory only for transferring large amounts of data between processes. Even with this scheme you'll need a central process to coordinate the allocation of buffers, so this scheme is worth it only if you have very large volumes of data to transfer. Alternatively, use a synchronization library, like Boost.Interprocess.

Diomidis Spinellis
  • 18,734
  • 5
  • 61
  • 83