How to efficiently do scattered summing with SSE/x86

Question

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:

[198, {0.4,0,1}],  [775, {0.25,0.8,0}],  [12, {0.5,0.5,0.02}]

and I need to sum them into memory like so:

memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12]  += {0.5,0.5,0.02}

To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.

The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?

Thanks for any help. I can assume Nehalem and above, if that makes a difference.

I assume you access a tightly packed array in memory. As the width of each element is 3 floats most of the elements will not be aligned to the 16 byte boundaries, making the SSE moves extremely slow. Also as your example suggests there is no simple pattern behind the data access, making it hard to prefetch. In sum this will make it hard to leverage the potential of SSE. Probably the best guess is to simply do this in C/++ and let the compiler do the magic, I doubt that there are many possibilities to improve this with SIMD. — Nobody moving away from SE, Jan 28 '12 at 22:40
I can use 4-float vectors in memory to get 16-bite alignment without a problem. How would prefetching help here? It'd make the summing easier for sure, but prefetching+sum+write opens me up to all kinds of race conditions with other threads doing the same thing. — mistermost, Jan 28 '12 at 22:59
The alignment does not matter when the memory was prefetched but when not you will have to either use unaligned move or ensure the data is aligned, which you can not when you access the densely packed array where the elements are 3 floats wide. — Nobody moving away from SE, Jan 28 '12 at 23:05

score 0 · Answer 1 · answered Jan 29 '12 at 00:09

0

You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.

answered Jan 29 '12 at 00:09

Dmitry Shkuropatsky

3,902
2
21
13

score 0 · Answer 2 · answered Jan 29 '12 at 11:12

Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.

Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.

In case of single-CPU system, use only one thread accessing destination data.

Probably, the only practical use for more cores in CPUs is to load input data with additional threads.

One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).

You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.

As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

How to efficiently do scattered summing with SSE/x86

2 Answers2