Maybe xargs can do this maybe it can't, but it seems possible. The solution does not need to use xargs at all. Would prefer all bash commands but no python. It has to work on a massive number of input files though (only toy-size example is shown here) and therefore not try to load all the files' contents to memory up front.
The starting input is 5 filenames in a text file 'docs.txt' all in one column:
[ga@sam ~]$ cat docs.txt
a.1.txt
a.2.txt
b.1.txt
c.1.txt
c.2.txt
The required output is exactly 3 files: Output file a.doc will contain the contents of a.1.txt and a.2.txt in this order. Output file b.doc: b.1.txt's contents. Output file c.doc: Contents of files c.1.txt and c.2.txt in this order.
What I'm doing currently is xargs is receiving 3 lines of input, and gnu paste concatenates the contents of the files listed on each line. I wish xargs would output exactly 3 text files, one per xargs input line, named as shown above based on each group-by value as explained, but I haven't found the trick.
Here's the code thus far:
[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' | xargs -L 1 paste -s
my cat
has fleas
my dog is clean
the bat
ate a rat
[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' # | xargs -L 1 paste -s
a.1.txt a.2.txt
b.1.txt
c.1.txt c.2.txt [ga@sam ~]$
[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' | xargs -L 1 -P 0 --process-slot-var=f paste -s > "$f".doc
xargs: unrecognized option '--process-slot-var=f'
The purpose of awk here is simply to effect a group-by (like a SQL groupby) the filenames' first field. This way each group is hoped to get exactly one output file created.
The purpose of paste here is just like cat. I will concatenate files together sequentially is all. If we want to use cat instead of paste it would work just as well probably, if a bit slower than paste, and the cat command would look like this across 3 invocations:
cat a.1.txt a.2.txt > a.doc
cat b.1.txt > b.doc
cat c.1.txt c.2.txt > c.doc
But like I tried to explain, I don't want to code explicitly 3 cat lines in advance because it is going to be a dynamically determined number of output files, based entirely on the groups found inside the input file.
Even if I upgrade my xargs to latest version, I still expect a critical inability to produce exactly 3 output files with my code written as shown above. xargs -process-slot-var seems to produce a number of files based on system characteristics instead of 3 in this application, and more importantly, the number of output files varying directly by number of groups found in the actual application.
If a one-liner won't work, I could perhaps fall back to use some kind of looping structure (in awk?) to do some variable substitutions which eventually emit one line of bash command per output file. I don't know awk well enough to emit commands. If done this way, I'd prefer bash parallel to run the lines in parallel as there are going to be many millions of output files as described in this application.
Thanks for ideas.