I'm trying to create a large number of sha256 hashes quickly on a T4 machine. The T4 has a 'sha256' instruction which allows me to calculate a hash in one op code. I created an inline assembly template to call the sha256 opcode:
in my c++ code:
extern "C"
{
void ProcessChunk(const char* buf, uint32_t* state);
}
pchunk.il:
.inline ProcessChunk,8
.volatile
/* copy state */
ldd [%o1],%f0 /* load 8 bytes */
ldd [%o1 + 8],%f2 /* load 8 bytes */
ldd [%o1 +16],%f4 /* load 8 bytes */
ldd [%o1 +24],%f6 /* load 8 bytes */
/* copy data */
ldd [%o0],%f8 /* load 8 bytes */
ldd [%o0+8],%f10 /* load 8 bytes */
ldd [%o0+16],%f12 /* load 8 bytes */
ldd [%o0+24],%f14 /* load 8 bytes */
ldd [%o0+32],%f16 /* load 8 bytes */
ldd [%o0+40],%f18 /* load 8 bytes */
ldd [%o0+48],%f20 /* load 8 bytes */
ldd [%o0+56],%f22 /* load 8 bytes */
sha256
nop
std %f0, [%o1]
std %f2, [%o1+8]
std %f4, [%o1+16]
std %f6, [%o1+24]
.end
Things are working great in a single threaded environment but it is not fast enough. I used openmp to parallelize the application so that I can call ProcessChunk simultaneously. The multithreaded version of the application works OK for a few threads but when I increase the number of threads (16 for example) I begin to get bogus results. The inputs to the ProcessChunk function are both stack variables local to each thread. I've confirmed that the inputs are generated correctly no matter the number of threads. If I put ProcessChunk into a critical section, I get correct results but the performance degrades significantly (single thread performs better). I'm stumped on what the problem might be. Is it possible for solaris threads to step on floating point registers of another thread?
Any ideas how I can debug this?
Regards
Update:
I changed the code to use quad sized (16 byte) load and saves:
.inline ProcessChunk,8
.volatile
/* copy state */
ldq [%o1], %f0
ldq [%o1 +16],%f4
/* copy data */
ldq [%o0], %f8
ldq [%o0+16],%f12
ldq [%o0+32],%f16
ldq [%o0+48],%f20
lzd %o0,%o0
nop
stq %f0, [%o1]
stq %f4, [%o1+16]
.end
At first glance the issue seems to have gone away. The performance degrades significantly after 32 threads so that is the number I'm sticking with (for the moment at least) and with the current code I seem to be getting correct results. I probably just masked the issue so I'm going to run further tests.
Update 2:
I found some time to go back to this and I was able to get decent results from the T4 (10s of millions of hashes in a minute).
The changes I made were:
- Used assembly instead of inline assembly
- As the functions were leaf functions, I didn't touch the register window
I packed everything up in a library and made the code available here