A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.
OPs architecture
Lets look at your architecture:
+--------------+--------------+
| Input 1 | Input 2 |
+--------------+--------------+
| QUEUE A |
+--------------+--------------+
| Scrub 1 | Scrub 2 |
+--------------+--------------+
| QUEUE B |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+
Discussion
Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!
Parallel Pipeline Architecture
A somewhat different architecture could look like this:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
| QUEUE 1B | | QUEUE 2B |
+-----+-----+ +-----+-----+
| Cmp | Cmp | | Cmp | Cmp |
+-----+-----+ +-----+-----+
Discussion
Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.
The Lawn Sprinkler Architecture
Concept
An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)
Each Queue write should go to a different queue:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| \ / |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
/ | \ \ / / | \
+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
| Cmp | Cmp | Cmp | Cmp |
+-------+-------+-------+-------+
This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.
Discussion
All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.
So if this architecture makes sense depends on your exact requirements.
It is slower for evenly sized inputs, but performs better on irregular input.
Appendices
On implementing:
What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see @Leon Timmermans answer) in addition to all the synchronization overhead.
On scalability:
Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.