-2

Maybe xargs can do this maybe it can't, but it seems possible. The solution does not need to use xargs at all. Would prefer all bash commands but no python. It has to work on a massive number of input files though (only toy-size example is shown here) and therefore not try to load all the files' contents to memory up front.

The starting input is 5 filenames in a text file 'docs.txt' all in one column:

[ga@sam ~]$ cat docs.txt
a.1.txt
a.2.txt
b.1.txt
c.1.txt
c.2.txt

The required output is exactly 3 files: Output file a.doc will contain the contents of a.1.txt and a.2.txt in this order. Output file b.doc: b.1.txt's contents. Output file c.doc: Contents of files c.1.txt and c.2.txt in this order.

What I'm doing currently is xargs is receiving 3 lines of input, and gnu paste concatenates the contents of the files listed on each line. I wish xargs would output exactly 3 text files, one per xargs input line, named as shown above based on each group-by value as explained, but I haven't found the trick.

Here's the code thus far:

[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' | xargs -L 1 paste -s
my cat
has fleas
my dog is clean
the bat
ate a rat
[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' # | xargs -L 1 paste -s
a.1.txt a.2.txt
 b.1.txt
 c.1.txt c.2.txt [ga@sam ~]$
[ga@sam ~]$ cat docs.txt | awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1' | xargs -L 1 -P 0 --process-slot-var=f paste -s > "$f".doc
xargs: unrecognized option '--process-slot-var=f'

The purpose of awk here is simply to effect a group-by (like a SQL groupby) the filenames' first field. This way each group is hoped to get exactly one output file created.

The purpose of paste here is just like cat. I will concatenate files together sequentially is all. If we want to use cat instead of paste it would work just as well probably, if a bit slower than paste, and the cat command would look like this across 3 invocations:

cat a.1.txt a.2.txt > a.doc
cat b.1.txt > b.doc
cat c.1.txt c.2.txt > c.doc

But like I tried to explain, I don't want to code explicitly 3 cat lines in advance because it is going to be a dynamically determined number of output files, based entirely on the groups found inside the input file.

Even if I upgrade my xargs to latest version, I still expect a critical inability to produce exactly 3 output files with my code written as shown above. xargs -process-slot-var seems to produce a number of files based on system characteristics instead of 3 in this application, and more importantly, the number of output files varying directly by number of groups found in the actual application.

If a one-liner won't work, I could perhaps fall back to use some kind of looping structure (in awk?) to do some variable substitutions which eventually emit one line of bash command per output file. I don't know awk well enough to emit commands. If done this way, I'd prefer bash parallel to run the lines in parallel as there are going to be many millions of output files as described in this application.

Thanks for ideas.

Geoffrey Anderson
  • 1,534
  • 17
  • 25

4 Answers4

5

You can use cut and sort to extract the groups, then a while read loop to cat the group files together:

cut -d. -f1 docs.txt |
  sort -u |
  while read -r group; do cat "$group".*.txt > "$group".doc; done

Also, plain bash

while IFS=. read -r group rest; do
    cat "$group.$rest" >> "$group.doc"
done < docs.txt

or plain awk

awk -F. '{
    f = $1 ".doc"
    while (( getline line < $0 ) > 0)
        print line > f
    close($0)
}' docs.txt
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
1

Could you please try following solution once.

sort -t'.' -k1 docs.txt | awk -F'.' 'prev!=$1{close(file);file=$1".doc"} {print > file;prev=$1}'

Adding a non-one liner form of solution too now.

sort -t'.' -k1 docs.txt |
awk -F'.' '
  prev!=$1{
    close(file) 
    file=$1".doc"
  }
{
  print > file
  prev=$1
}'
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Untested but should be close:

awk '
    NR==FNR { ARGV[ARGC++]=$0; next }
    FNR==1 { close(out); out=FILENAME; sub(/\..*/,".doc",out) }
    { print >> out }
' docs.txt
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
-2

The following code is the solution I went with. I developed it, but did not use anyone else's code submissions, partly because other people's submissions had not become available yet while I was developing it. Thanks for all your responses and answers and comments in any case. The code below runs fast and does everyrhting that is required. It also has no explicit loops, which is interesting. You might enjoy seeing my final code since many of you took an interest. Best regards. As policy I withhold upvotes as long as nobody contributes upvotes to me (still zero) for my original question, despite the attention it has attracted; but I happily give back.

#!/bin/bash
# Inputs from tmp subdir
# Outputs to consolidated subdir
# Please run in dir above tmp
# No pipes allowed in an array element apparently? But PASTING worked OK, maybe since a string contains the pipe.
# The head (below) after INFILESSORT is only for dev speed.
# For dev and debugging only please remove --max-procs=0 which is for parallelism.

INFILESFIND=(find tmp -name "*.doc" -type f)
INFILESSORT=(sort -k1 -k2 -t'.')
GROUPING=(awk -F. '{ORS=" "}NR==1 {prev=$1; print; next} prev!=$1{print "\n";}{prev=$1}1')
PASTING=(xargs --max-procs=0 -L 1 -I filenames sh -c 'echo "filenames" | xargs -L 1 paste -s > consolidated/$(echo $(basename "filenames") | cut -f1 -d.).doc')
# The following line executes the script's arrays that were defined above.
"${INFILESFIND[@]}" | "${INFILESSORT[@]}" | "${GROUPING[@]}" | "${PASTING[@]}"
Geoffrey Anderson
  • 1,534
  • 17
  • 25
  • It won't let me accept this answer for 14 hours. "That's a bold strategy, SO, let's see how it works out for him." We'll see if I find it again and accept the answer (which I accepted in real life) before I move on and forget. – Geoffrey Anderson Nov 21 '18 at 00:05
  • 2
    wrt `As policy I withhold upvotes as long as nobody contributes upvotes to me` - you might want to keep that to yourself or even better reconsider it as you're basically saying "I'll only thank people who try to help me if they first thank me for asking them for help" which is an extremely odd philosophy and not likely to be taken favorably. Also - you got several good answers, consider using one of them instead of the extremely convoluted, buggy, non-portable, and inefficient approach in your own answer. – Ed Morton Nov 23 '18 at 12:53