0

I have coded a JPG decoder as such

for each dataunit{
  decode
  transform
  write to rgb buffer
}

Then I coded it with boost threads as such

for each dataunit{
  decode
}
for each dataunit{
  transform
}
for each dataunit{
  write to rgb buffer
}

...running these loops on their own thread with 2 threads running in parallel on a 3 core CPU. But I can't seem to beat my performance with the non threaded program. Am I missing something?

Do threads hamper the compiler's ability to optimize the program?

Will a non threaded program still use the 3 cores of my CPU?

thanks so much for clearing anything up.

Edit: apparently my threads were all accessing the same buffer (not the same locations in the buffer) and that causes great CPU cache coherency overhead. Each CPU core has its own cache that needs to sync with the other caches if any changes are made to shared buffer. I retooled my code to split my buffers into 3 and then have each thread work on their own buffer. I was hoping this would solve any cache coherency problems but it hasn't seemed to speed up my program. I still cannot the beat the serial program with my parallel one.

Edit: I'm embarrassed to say that I was measuring the CPU time of my program and not the WALL time. WALL time clearly shows my program is ~50% faster when it is threaded. The CPU time of the threaded program is actually higher by ~7% because it adds the work done by the 3 cores in the CPU (I presume) with extra overhead from managing the threads.

deanresin
  • 1,466
  • 2
  • 16
  • 31
  • 1
    Are these three steps dependent on each other? Does the first step need to be completed to make input for the second step? Or are they independent? – bosnjak Mar 24 '14 at 02:07
  • 2
    If you are memory-access bound there is no reason to expect threading to speed things up (indeed it could make things worse due to cache conflicts). These details matter. – dmckee --- ex-moderator kitten Mar 24 '14 at 02:09
  • @Lawrence I have structured it so that threads running concurrently are independent. I group.join_all() the threads once before "writing to rgb buffer" but then I can continue with running concurrent threads. – deanresin Mar 24 '14 at 02:14
  • @dmckee The running threads are independent of each other and will never access the same memory at the same time (no mutual exclusions defined) – deanresin Mar 24 '14 at 02:16
  • 1
    That is not what I was talking about. I was talking about the raw amount of time it takes to access the memory and about the possibility of cache contention (several contexts fighting to fill the cache and getting in each other's way). – dmckee --- ex-moderator kitten Mar 24 '14 at 02:39
  • wow that is a little over my head. each thread is accessing the same buffer over and over. i guess that must be my problem to which there is no solution in boost threads. – deanresin Mar 24 '14 at 02:41
  • I think I will split my buffer into 3 and make sure no thread is operating on the same buffer to avoid the cache coherence problem. – deanresin Mar 24 '14 at 05:20
  • 1
    If you are willing to consider different threading solutions than Boost, I recommend you take a look at Intel's TBB. Your code seems a good match for its `parallel_pipeline` pattern. – Alexey Kukanov Mar 25 '14 at 10:00

2 Answers2

1

Your design is probably inefficient. First, you keep having to pass the data from thread to thread. Second, if one of these three steps takes significantly more time than the other two, the potential maximum benefit is small.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 1
    I'm not passing any data. These are member functions accessing member buffers. The threads access the same buffer but at different locations from the other threads. – deanresin Mar 24 '14 at 02:23
  • 1
    Umm, that's how you pass data from one thread to another. When a different thread starts to work on the data, it's most likely running on a different core, and then the CPU's cache coherency protocol has to move all the data to the caches associated with the core the new thread is running on. That not only slows that thread down, but it consumes precious inter-core bandwidth. – David Schwartz Mar 24 '14 at 02:32
  • all threads are invoked from within the object and have access to the same member buffer. So your telling me each CPU core is moving my 3MB buffer around? why would it do that? why can't it just access it at its current memory location? – deanresin Mar 24 '14 at 02:38
  • @deanresin It can't because memory is very, very slow compared to the cores. That's why CPUs have caches. Computers would be dozens of times slower if data could only move from one thread or core to another by going through main memory. It would be lunacy to force one core to write data all the way back to main memory just so another core could go all the way to main memory to read it back. Instead, inter-core cache coherency protocols (punch `MESI` into your favorite search engine) are used. – David Schwartz Mar 24 '14 at 02:41
  • 1
    its not a bad answer, and it seems intel has somehow given up on L2 for this reason. If you check newest core i7 Xeon, they have 256kB L2 caches (crazy, its the same size than 10 years ago's celeron.) But they have a large L3 cache, because it is shared by all cores. So inter-core bus may be saved. Also since each of his threads are accessing different addresses anyway, it may be pulled orderly from memory without having to migrate. That's why I don't upvote. – v.oddou Mar 24 '14 at 03:24
  • @v.oddou The second issue is likely the bigger one. – David Schwartz Mar 24 '14 at 03:49
  • I split my buffer into 3 separate buffers and made sure no threads were working on the same buffer at one time to avoid any cache coherency overheads. It didn't make any difference. – deanresin Mar 24 '14 at 06:47
0

I'm embarrassed to say that I was measuring the CPU time of my program and not the WALL time. WALL time clearly shows my program is ~50% faster when it is threaded. The CPU time of the threaded program is actually higher by ~7% because it adds the work done by the 3 cores in the CPU (I presume) with extra overhead from managing the threads.

deanresin
  • 1,466
  • 2
  • 16
  • 31