4

I have a pipe that gives me lines of two quoted space separated strings. Using echo to give you an example of the pipe content:

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""

"filename1" "some text 1"
"filename2" "some text 2"

First string is a filename and the second is the text I want to append to that file. Getting the handle to $filename and $text with "read" is easy:

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""|
while read filename text; do echo $text $filename; done

"some text 1" "filename1"
"some text 2" "filename2"

but "parallel" doesn't want to treat the two strings on the line as two parameters. It seems to treat them as one.

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""|
parallel echo {2} {1}

"filename1" "some text 1"
"filename2" "some text 2"

So just having {1} on the line gives the same result

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""|
parallel echo {1}

"filename1" "some text 1"
"filename2" "some text 2"

Adding --colsep ' ' makes it break the strings on every space

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""|
parallel --colsep ' ' echo {2} {1}

"some "filename1"
"some "filename2"

I just could not find the explanation on how to handle this case through the pipe to parallel in its documentation https://www.gnu.org/software/parallel/man.html

Adding a --delimiter ' ' option gives this

echo -e "\"filename1\" \"some text 1\"\n\"filename2\" \"some text 2\""| 
parallel --delimiter ' ' echo {2} {1}

"filename1"
"some
text
1"
"filename2"
"some
text
2"

This is the closest I have found

seq 10 | parallel -N2 echo seq:\$PARALLEL_SEQ arg1:{1} arg2:{2}

seq:1 arg1:1 arg2:2
seq:2 arg1:3 arg2:4
seq:3 arg1:5 arg2:6
seq:4 arg1:7 arg2:8
seq:5 arg1:9 arg2:10

but it doesn't really reflect my data as seq 10 has a new line after each string and I have two strings on the line.

1
2
3
4
5
6
7
8
9
10

My current workaround is just to change the pipe to have a comma instead of a space to separate the quoted strings on a line:

echo -e "\"filename1\",\"some text 1\"\n\"filename2\",\"some text 2\""|
parallel --colsep ',' echo {2} {1}

"some text 1" "filename1"
"some text 2" "filename2"

But how to handle this with parallel?

Diego
  • 812
  • 7
  • 25
  • 1
    Do you have to use `parallel`? GNU Awk can read the quoted strings properly, though they have spaces inside and can easily parse multi line content – Inian Jan 24 '19 at 06:09
  • Just use `\t` as your field delimiter. It will make many things easier and your bash code shorter;) – liborm Jan 24 '19 at 09:39
  • @Inian actually this is what I am using right now: awk -F, '{gsub("\"","", $0); print($2)>$1".txt"}' but it took some time to research. Parallel was looking so promising also because the task seemed CPU bound. – Diego Jan 24 '19 at 10:34

4 Answers4

3

If you're fine with the quotes being stripped, then the --csv option paired with --colsep will split where you want it to (and still retains all whitespace properly)

echo -e "\"filename1\" \"some text 1\"\n\"filename2 withspaces\" \"some text   2\""|
parallel --csv --colsep=' ' echo arg1:{1} arg2:{2}

outputs:

arg1:filename1 arg2:some text 1
arg1:filename2 withspaces arg2:some text   2

Note --csv requires installing the perl Text::CSV module (sudo cpan Text::CSV)

And if you want to keep the quotes, a mix of -q and some extra quotes will add them back:

echo -e "\"filename1\" \"some text 1\"\n\"filename2 withspaces\" \"some text   2\""|
parallel -q --csv --colsep=' ' echo 'arg1:"{1}" arg2:"{2}"'

outputs:

arg1:"filename1" arg2:"some text 1"
arg1:"filename2 withspaces" arg2:"some text   2"

--csv is only in recent versions of parallel (since 2018-04-22). If you're on an older parallel you'd be better off first transforming the input with a preprocessing step into a format parallel can handle. The only way I could see to do it with pure parallel is a really hacky exploitation of shell quoting and mucking with parallel internals:

echo -e "\"filename1\" \"some text 1\"\n\"filename2 with spaces\" \"some text    2\""|
parallel sh -c "'echo arg1:\"\$1\" arg2:\"\$2\"'" echo '{= $Global::noquote = 1 =}'

outputs:

arg1:filename1 arg2:some text 1
arg1:filename2 with spaces arg2:some text    2

How this works I'll leave as an exercise... running with parallel --shellquote will show the command it is constructing under the hood.

George
  • 4,147
  • 24
  • 33
  • echo -e "\"filename1\" \"some text 1\"\n\"filename2 withspaces\" \"some text 2\""| parallel --colsep=' ' --csv echo arg1:{1} arg2:{2} **Unknown option: csv** sudo cpan Text::CSV Loading internal null logger. Install Log::Log4perl for logging messages Reading '/home/user/.cpan/Metadata' Database was generated on Thu, 24 Jan 2019 09:41:02 GMT Text::CSV is up to date (1.99). parallel --version GNU parallel 20161222 – Diego Jan 24 '19 at 10:24
  • 1
    Oh, looks like the `--csv` was released on 2018-04-22, so you would need a newer version of `parallel`. I added a _really_ hacky potential workaround to my answer, but I don't recommend it... – George Jan 24 '19 at 18:49
1

When running jobs in parallel you risk race conditions: If two jobs append to the same file at exactly the same time, the content of the file may be garbled.

There are several ways of avoiding that:

Separate workdirs

By having separate workdirs the each process will only append to files in its own workdir. When the work is done, the workdirs should then be merged.

If the inputfile is 1 TB this means you need 2 TB free to run.

Put the filenames into bins

If all files of a given name is only given to a single process, then no other process will append at the same time. One way to do this would be to compute a hash of the filename and the distribute this to workers based on the hash value.

Something similar to:

#!/usr/bin/perl

use B;

# Set the number of bins to use (typically number of cores)
$bins = 9;

for(1..$bins) {
    # Create fifo and open filehandle
    mkfifo($_);
    open $fh{$_}, ">", "fifo-$_";
}

if(not fork) {
    # Start the processors
    `parallel -j0 'cat {} | myprocess' ::: fifo-*`;
    exit;
}

my @cols;
while(<>) {
    # Get the column with the filename
    # Here we assume the columns are , separated
    @cols = split(/,/,$_);
    # We assume the value we need to group on is column 1
    # compute a hash value of the column
    # modulo number of bins
    # print output to that fifo
    print $fh{ hex(B::hash($col[1]))%$bins } $_;
}

# Cleanup
for(1..$bins) {
    close $fh{$_};
    unlink "fifo-$_";
}

If the inputfile is 1 TB this means you need 1 TB free to run.

Group the filenames

This is similar to the previous idea, but instead of hashing each line, you sort the inputfile, insert a marker after each new filename, and let GNU Parallel use the marker as end of a record. For this to work you need to have quite a few output files, so that you can have all records of multiple files in memory at the same time.

If the inputfile is 1 TB this means you need 2 TB free to run.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
0

parallel handles quotes/escapes quite properly, so feel free to simplify the input first - just lay it out per interleaved lines to let parallel -n2 further digest it:

$ echo -e '"file 1" "text 1"\n"file 2" "text 2"'
"file 1" "text 1"
"file 2" "text 2"
$ echo -e '"file 1" "text 1"\n"file 2" "text 2"'|sed 's/^"\(.*\)" "\(.*\)"/\1\n\2/'
file 1
text 1
file 2
text 2
$ echo -e "file 1\ntext 1\nfile 2\ntext 2"
file 1
text 1
file 2
text 2

run 1:

$ echo -e "file 1\ntext 1\nfile 2\ntext 2"|parallel -n2 'echo {2} >> {1}'
$ grep . file*
file 1:text 1
file 2:text 2

run 2 (with some quotes):

$ echo -e "file 1\ntext 1 with double-quotes \"\nfile 2\ntext 2 with single-quote '"|parallel -n2 'echo {2} >> {1}'
$ grep . file*
file 1:text 1
file 1:text 1 with double-quotes "
file 2:text 2
file 2:text 2 with single-quote '
Vlad
  • 1,157
  • 8
  • 15
  • I have thought about this solution but actually I need to write each line of text to the corresponding file and I am afraid of having parallel -n2 echo {2} >> {1}.txt type of approach writing the files all over my filesystem (using text in the "text" part of the line) if the interleaved flow gets out of order. – Diego Jan 24 '19 at 10:17
  • @Diego, if interleaved flow can "get out of order" (e.g. if u have multi-line texts, etc) - then you're in trouble with any solution, but can safeguard yourself by pre-pending it with 3rd-line ("marker") every time (same way u guys tried with `arg1` just syntax is cleaner here) like this: `parallel -n3 'if [ "{1}" == "marker" ]; then echo {3} >> {2};fi'` – Vlad Jan 28 '19 at 19:30
0

This is what I have ended up doing where awk takes over the field splitting and the separator character is "," in the preceding pipe output. (btw parallel brings 30x speed up to a naked awk):

parallel -j4 --pipe -q awk -F, '{ gsub("\\\\\"",""); gsub("\"",""); print($2)>>$1".txt"}'

But the proper answer to my original question about parallel is probably the --csv --colsep ' ' flag combination from @George-P https://stackoverflow.com/a/54340352/4634344. I could not test it yet as my parallel version doesn't yet support the --csv flag.

Diego
  • 812
  • 7
  • 25
  • Is the reason why your version does not support `--csv` covered on https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/ ? – Ole Tange Jan 25 '19 at 12:42
  • My parallel binary came with the Ubuntu. I have successfully installed parallel from the sources before on other machines so that won't be a problem. It just seems that the awk is a more performant way of doing what I wanted to do (echo "$text" >> "$filename") – Diego Jan 27 '19 at 09:01
  • How do you make sure you do not append to the same file in parallel (similar to https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P)? If you sort the input by filename then that ought to limit the risk of that happening (so simply `sort input.txt | parallel --pipe ...`) – Ole Tange Jan 28 '19 at 09:30
  • I'm calling "parallel (echo $text >> $filename)" not "(parallel echo $text) >> $filename", does your concern still apply? Sort is not practical because I do not have disk space to cache the intermediate data. – Diego Jan 28 '19 at 09:49
  • 1
    Then the concern applies: You risk running two appends in parallel. One way to guard against it is to have each parallel job run in its own dir and then merge the files in the dirs when done. But if you do not have disk space for that then that might be problematic to do. Maybe you can sort blocks and run each block, so that you only have a risk at block border? – Ole Tange Jan 28 '19 at 11:53
  • how about using sem --id atomicwrite echo hi >> file from this answer https://stackoverflow.com/a/32903686 ? – Diego Jan 31 '19 at 10:14
  • More like sem --id $file echo $text >> $file with the semaphore name being the filename. One caveat being there are tens of thousands of those files. And besides having some more logic to debug there will be some IOs for creating and checking those semaphores, right? – Diego Jan 31 '19 at 10:46
  • Sorry, no. You already have opened the file, so you may still loose data. And `sem` is a pretty heavy command command taking 100 ms. – Ole Tange Jan 31 '19 at 16:59
  • OK thanks for that detail. Sorting within blocks may be a way forward then, especially if it will speed up cutting on the number of atomic IOs due to grouping and accumulation of records. – Diego Feb 02 '19 at 05:36