Is it possible to distribute STDIN over parallel processes?

Question

Given the following example input on STDIN:

foo
bar bar
baz
===
qux
bla
===
def
zzz yyy

Is it possible to split it on the delimiter (in this case '===') and feed it over stdin to a command running in parallel?

So the example input above would result in 3 parallel processes (for example a command called do.sh) where each instance received a part of the data on STDIN, like this:

do.sh (instance 1) receives this over STDIN:

foo
bar bar
baz

do.sh (instance 2) receives this over STDIN:

qux
bla

do.sh (instance 3) receives this over STDIN:

def
zzz yyy

I suppose something like this is possible using xargs or GNU parallel, but I do not know how.

Ole Tange · Accepted Answer · 2011-02-17T14:48:08.037

13

GNU Parallel can do that from version 20110205.

cat | parallel --pipe --recend '===\n' --rrs do_stuff

edited Feb 17 '11 at 14:48

answered Jan 11 '11 at 14:26

Ole Tange

1,990
16
10

4

This answer could use more explanation: `pipe` causes parallel to write to stdin instead of passing parameters, `recend` is the record end, i.e. the split string, and `rrs` stands for remove record start, i.e. removing the `'===\n'`. Other than these, `--keep-order` is useful. – Caesar Nov 15 '18 at 05:52

score 3 · Answer 2 · edited Apr 13 '17 at 12:13

In general, no. One of the reasons for this assessment is that standard I/O reading from files, rather than the terminal, reads blocks of data - BUFSIZ bytes at a time, where BUFSIZ is usually a power of 2 such as 512 or larger. If the data is in a file, one process would read the whole file shown - the others would see nothing if they shared the same open file description (similar to a file descriptor, but several file descriptors can share the same open file description, and could be in different processes), or would read the whole same file if they did not share the same open file description.

So, you need a process to read the file that knows it needs to parcel the information out to the three processes - and it needs to know how it is to connect to the three processes. It might be that your distributor program runs the three processes and writes to their separate pipe inputs. Or it could be that the distributor connects to three sockets and writes to the different sockets.

Your example doesn't show/describe what would happen if there were 37 sections separated by the marker.

I have a home-brew program called tpipe that is like the Unix tee command, but it writes a copy of (all of) its standard input to each of the processes, and to standard output too by default. This might be a suitable basis for what you need (it at least covers the process management part of it). Contact me if you want a copy - see my profile.

If you are using Bash, you can use regular tee with process substitution to simulate tpipe. See this article for an illustration of how.

See also SF 96245 for another version of the same information - plus a link to a program called pee that is quite similar to tpipe (same basic idea, slightly different implementation in various respects).

I wrote tpipe and hadn't heard of pee before. But it doesn't surprise me that someone else had the same basic requirement and implemented it. I'm not sure if you can guess how hard it is to search for 'pee' via Google (even 'site:gnu.org pee' turns up spam)! So, without a URL to the software, I cannot compare and contrast for you. — Jonathan Leffler, Jan 11 '11 at 22:13
http://serverfault.com/questions/96245/linux-debian-what-does-pee-in-moreutils-do shows `pee` in use and shows how you do not need `pee` in `bash`: `cat file | tee >(command1 >out1) >(command2 >out2)` — Ole Tange, Jan 16 '11 at 22:12
@Ole: thanks for the URL. I note that `pee` has different semantics from `tpipe` on several counts: most notably, `tpipe` keeps writing to available pipes until they are all closed, rather than stopping on the first error as `pee` does. The `bash` facilities are good if `bash` is reliably available on all platforms of interest, saving me the job. (The URL in the tail of my answer points to the same notation, but not the same article) as yours. — Jonathan Leffler, Jan 16 '11 at 23:17

score 1 · Answer 3 · answered Jan 11 '11 at 16:34

You can do this using named pipes. Named pipes allow you to treat the standard piping as files. You can have multiple named pipes and have your other programs process them.

I'm not all that familiar with named pipes, but I've used them from time to time in situations like this.

Is it possible to distribute STDIN over parallel processes?

3 Answers3

Linked