Multithreaded inline assembly

Question

I'm trying to create a large number of sha256 hashes quickly on a T4 machine. The T4 has a 'sha256' instruction which allows me to calculate a hash in one op code. I created an inline assembly template to call the sha256 opcode:

in my c++ code:

extern "C"
{
   void ProcessChunk(const char* buf, uint32_t* state);
}

pchunk.il:

.inline ProcessChunk,8  
.volatile
  /* copy state */
  ldd [%o1],%f0 /* load 8 bytes */ 
  ldd [%o1 + 8],%f2 /* load 8 bytes */ 
  ldd [%o1 +16],%f4 /* load 8 bytes */ 
  ldd [%o1 +24],%f6 /* load 8 bytes */ 

  /* copy data */
  ldd [%o0],%f8 /* load 8 bytes */ 
  ldd [%o0+8],%f10 /* load 8 bytes */ 
  ldd [%o0+16],%f12 /* load 8 bytes */ 
  ldd [%o0+24],%f14 /* load 8 bytes */ 
  ldd [%o0+32],%f16 /* load 8 bytes */ 
  ldd [%o0+40],%f18 /* load 8 bytes */ 
  ldd [%o0+48],%f20 /* load 8 bytes */ 
  ldd [%o0+56],%f22 /* load 8 bytes */ 

  sha256
  nop

  std %f0, [%o1]
  std %f2, [%o1+8]
  std %f4, [%o1+16]
  std %f6, [%o1+24]
.end

Things are working great in a single threaded environment but it is not fast enough. I used openmp to parallelize the application so that I can call ProcessChunk simultaneously. The multithreaded version of the application works OK for a few threads but when I increase the number of threads (16 for example) I begin to get bogus results. The inputs to the ProcessChunk function are both stack variables local to each thread. I've confirmed that the inputs are generated correctly no matter the number of threads. If I put ProcessChunk into a critical section, I get correct results but the performance degrades significantly (single thread performs better). I'm stumped on what the problem might be. Is it possible for solaris threads to step on floating point registers of another thread?

Any ideas how I can debug this?

Regards

Update:

I changed the code to use quad sized (16 byte) load and saves:

.inline ProcessChunk,8
.volatile
  /* copy state */
  ldq [%o1],    %f0
  ldq [%o1 +16],%f4

  /* copy data */
  ldq [%o0],   %f8
  ldq [%o0+16],%f12
  ldq [%o0+32],%f16
  ldq [%o0+48],%f20

  lzd %o0,%o0
  nop

  stq %f0, [%o1]
  stq %f4, [%o1+16]
.end

At first glance the issue seems to have gone away. The performance degrades significantly after 32 threads so that is the number I'm sticking with (for the moment at least) and with the current code I seem to be getting correct results. I probably just masked the issue so I'm going to run further tests.

Update 2:

I found some time to go back to this and I was able to get decent results from the T4 (10s of millions of hashes in a minute).

The changes I made were:

Used assembly instead of inline assembly
As the functions were leaf functions, I didn't touch the register window

I packed everything up in a library and made the code available here

Yes. I turned off openmp optimizations using `-xopenmp=noopt`. — Mustafa Ozturk, Jan 31 '14 at 22:28
I assume you are using C and used assembly for the hash. How did you "Used assembly instead of inline assembly" did you instead rewrite everything in assembly or called a binary in assembly? Any advice on how to implement this on x86 machines? — winux, Jan 28 '19 at 09:00

score 1 · Answer 1 · edited Feb 03 '14 at 14:30

Not a Spark architecture expert (I might be wrong) but here's my guess:

Your inline assembly code loads the stack variable into a set of specific floating point registers to be able to call the sha asssembly operation.

How does this work for two threads? Both calls to ProcessChunk will try to copy different input values into the very same CPU registers.

The way I normally see it, is that CPU registers in asm code are like "global" variables for an high level programming language.

How many cores does your system have? Maybe you are fine until you have a thread per core/set of hardware registers. But that also imply the behavior of the code could be dependent on the way the threads are scheduled on the different cores of your system.

Do you know how the system behaves when it schedules threads from the same process on a CPU core? What I mean is: does the system store the registers of the unscheduled thread, like in a context switch?

A test I would run is to spawn a number of thread equals to the N of CPU cores and then run the same test with N+1 (my assumption here is that there is a floating point register set per CPU core).

I'm making an assumption that is not correct. My gut feeling is that each thread should have it's own 'virtual' registers. Consider the PC register. If a thread could overwrite the PC of another thread bad things would happen. The operating system, should be responsible for context switching (*i think*). The machine has two processors and 8 cores per proc. I get bad results at 4 threads. — Mustafa Ozturk, Feb 01 '14 at 14:52
you got my point, and what you said would make absolute sense, simply not sure how solaris/spark works in that extent. — sergico, Feb 01 '14 at 14:57

Multithreaded inline assembly

1 Answers1