Threads, Coro, Anyevent confusion

Question

I am relatively new to perl and even newer to threading in perl. I have a perl script that takes input from 3 different sources. (2 LDAP queries and a file that isn't always there) Because some parts can take longer than others so I decided to use threads and queues. During development, testing individual components of the script worked very well, but after putting it all together the performance seems to degrade.

Basic structure is this 2 threads:(Read file or Read AD entries) -> Queue1 -> 2 threads:(scrub data) -> Queue2 -> 3-4 threads(compare against existing local LDAP entries). Several threads report statistics back to the main script and once all threads are done an email is sent with all the stats and status of that run.

I am using dequeue_nb and I thought that would help but no luck.

The performance hit seems to be in the queues. While looking for tips to improve performance I've run into several articles saying perl threads are no good and to use coro, POE, Anyevent, IO:async, etc.

This doesn't seem like a "event" problem so I didn't think AnyEvent or POE would be the way to go by from what I'm seeing, coros only seems to use one CPU at a time so I'm not sure this would work either. I thought about using a combination of them but then my head started to hurt. Does anyone have any suggestions on how to either fix/troubleshoot my script or any suggestions how to implement one of the other modules?

For me, multithreading in Perl is often not worth the hassle; sometimes I `fork()` to achieve concurrency. A way to achieve your task is to create a new process for each query/input and write the results to a file. After that, munge the data in the main process. — amon, Jul 30 '12 at 19:09
@amon, Forking forces one to use a pipeline, which is good. The thing is, he (said he) is using a pipeline already. Forking is not going to change anything. — ikegami, Jul 30 '12 at 19:10

amon · Accepted Answer · 2012-07-31T04:12:11.820

A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.

OPs architecture

Lets look at your architecture:

+--------------+--------------+
|   Input 1    |   Input 2    |
+--------------+--------------+
|           QUEUE A           |
+--------------+--------------+
|   Scrub 1    |   Scrub 2    |
+--------------+--------------+
|           QUEUE B           |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+

Discussion

Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!

Parallel Pipeline Architecture

A somewhat different architecture could look like this:

+-----------+    +-----------+
|  Input 1  |    |  Input 2  |
+-----------+    +-----------+
| QUEUE  1A |    | QUEUE  2A |
+-----------+    +-----------+
|  Scrub 1  |    |  Scrub 2  |
+-----------+    +-----------+
| QUEUE  1B |    | QUEUE  2B |
+-----+-----+    +-----+-----+
| Cmp | Cmp |    | Cmp | Cmp |
+-----+-----+    +-----+-----+

Discussion

Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.

The Lawn Sprinkler Architecture

Concept

An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)

Each Queue write should go to a different queue:

  +-----------+   +-----------+
  |  Input 1  |   |  Input 2  |
  +-----------+   +-----------+
        |      \ /      |
  +-----------+   +-----------+
  | QUEUE  1A |   | QUEUE  2A |
  +-----------+   +-----------+
  |  Scrub 1  |   |  Scrub 2  |
  +-----------+   +-----------+
        / | \ \   / / | \      

+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
|  Cmp  |  Cmp  |  Cmp  |  Cmp  |
+-------+-------+-------+-------+

This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.

Discussion

All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.

So if this architecture makes sense depends on your exact requirements.

It is slower for evenly sized inputs, but performs better on irregular input.

Appendices

On implementing:

What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see @Leon Timmermans answer) in addition to all the synchronization overhead.

On scalability:

Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.

Thanks amon for the response. I've tried the Lawn Sprinkler method but it doesn't seem to help performance any. May actually hurt it a bit over the long haul. Because this was a time sensitive fix I've put what I have in production but just moved the timing of a few things around. I'll keep looking at it though. Since the scrub stage seems to be pretty fast and relativly small code wise I'm thinking about just combining it with the final step. Maybe this will help. — jaden, Aug 02 '12 at 21:13

score 0 · Answer 2 · answered Jul 30 '12 at 21:51

0

What kind of data are you putting in the queues? AFAIK simple data is cheaper than complex structures, since it needs to be clones and copied at least twice. I've been planning to write a faster queue implementation (most of the work is already done actually), but haven't published that yet.

answered Jul 30 '12 at 21:51

Leon Timmermans

30,029
2
61
110

The first Queue is receiving an array. The first value is the data's source which also tells the scrub threads what kinds of data the second value is. The second value is either a long string or the copy of a LDAP Entry. Once you have some (semi)working code I'd love to try it out. Maybe not for production yet, but maybe to help you work on it. – jaden Aug 02 '12 at 21:21

Threads, Coro, Anyevent confusion

2 Answers2

OPs architecture

Discussion

Parallel Pipeline Architecture

Discussion

The Lawn Sprinkler Architecture

Concept

Discussion

Appendices

On implementing:

On scalability: