2

I seem unable to fix my space in filenames issue using switches like -print0 for gnu-find and -0 for gnu-parallel, gnu-xargs in this scenario as is usually recommended.

I succeeded in combining find, parallel in pipe mode and xargs to run commands in parallel in "blocks" for 100k+ files. I use echo and ls in the examples below but I plan to use my own python command. Note that I want to run each command instance on more than one file due to overhead in starting my program hence the use of parallel in --pipe mode and --block etc. The command

find ./dirNames/ -type f | parallel --pipe --block 100 -j4 --round-robin "echo \"Start *****\"; cat ; echo \"Done *****\""

results in

Start *****
./dirNames/bbbbbbbbbbbbbbbb
./dirNames/dddddddddddddddddddd
./dirNames/aaaaaaaaaaaaaaaa
Done *****
Start *****
./dirNames/cccccccc cccccccc
./dirNames/eeeeeeeeeeeeeeeeeeee
Done *****

as desired. gnu-echo is run twice, in one instance it is run with 3 files and in the other instance with 2 files. If I try this with xargs and ls I run into the classic space in filename problem ...

find dirNames/ -type f | parallel --pipe --block 40 -j4 --round-robin "echo \"Start *****\"; xargs ls -l ; echo \"Done *****\""

Resulting in this

Start *****
-rw-rw-r-- 1 robert robert 0 Jun 24 10:10 dirNames/bbbbbbbbbbbbbbbb
-rw-rw-r-- 1 robert robert 0 Jun 25 16:11 dirNames/eeeeeeeeeeeeeeeeeeee
Done *****
Start *****
-rw-rw-r-- 1 robert robert 0 Jun 24 10:10 dirNames/aaaaaaaaaaaaaaaa
Done *****
Start *****
-rw-rw-r-- 1 robert robert 0 Jun 25 16:11 dirNames/dddddddddddddddddddd
Done *****
Start *****
Done *****
ls: cannot access 'dirNames/cccccccc': No such file or directory
ls: cannot access 'cccccccc': No such file or directory

which in this scenario I seem unable to fix using switches like -print0 for find and -0 for parallel and xargs as is usually recommended for this problem. parallel seems confused by the output of find with -print0. Please advise as I have truly run out of ideas :(

1 Answers1

1

This is the answer I posted to the GNU Parallel mailing list.

I think you need to use --recstart '\0' instead of --null on parallel. And I think you'll run into problems when you run your python script with the file names on the command line - I've used ls here to demonstrate a possible solution to that also.

$ find -type f -print0 | parallel --keep-order --no-run-if-empty --pipe --blocksize 15 --recstart '\0' --roundrobin \
  "echo start {#}; xargs -0r ls -Q 2>&- | xargs -rt ls --fu; echo end {#}"
start 1
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:39:02.916427000 -0700 ./a
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:40:33.076957000 -0700 ./g
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:40:33.096995000 -0700 ./i
end 1
ls --fu ./a ./g ./i 
start 2
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:39:02.916552000 -0700 ./b c
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:40:33.076553000 -0700 ./f
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:40:33.077123000 -0700 ./h
end 2
ls --fu './b c' ./f ./h 
start 3
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:39:02.916633000 -0700 ./d
-rw-r--r-- 1 larry wheel 0 2021-06-27 12:40:33.076273000 -0700 ./e
end 3
ls --fu ./d ./e 

Note the suppression of stderr on the first ls - without it there are error messages from ls about not being able to list file attributes on a null file name.

Larry
  • 310
  • 2
  • 11
  • Thank you again Larry and also for circling back to the question in its original form here. It solves my problem. – Normand Robert Jun 29 '21 at 14:19
  • 1
    Most likely you really want `--recend` instead of `--recstart`. – Ole Tange Jun 30 '21 at 18:45
  • Yes, `--recend` gets rid of the need to suppress stdout from the first `ls`, at the very least, and is more aligned with what `-print0` does on find. – Larry Jul 01 '21 at 20:28
  • What does that modification end up looking like? I did not understand. – Normand Robert Jul 12 '21 at 18:38
  • For example ```find -type f -print0 | parallel --pipe --blocksize 15 --recend '\0' --roundrobin ...``` – Larry Jul 14 '21 at 00:56
  • 1
    Example of use based on all the helpful advice: find ./dirNames -type f -print0 | parallel --pipe --blocksize 145 --recend '\0' --roundrobin "xargs -0 ls -Q 2>&- | xargs ./multiple_files_possibly_with_spaces_as_args.sh \"output dir\"" where ./multiple_files_possibly_with_spaces_as_args.sh is the shell script: #!/usr/bin/env bash echo "First argument=$1" shift echo "Start chunk" echo "Number of file(s) in this chunk \$#=$#" j=0 # Critial to use "$@" for file in "$@"; do echo "ls -l \$file($j) = ls -l "$file" = $(ls -l "$file")" ((j+=1)) done echo "End chunk" echo ""; – Normand Robert Nov 16 '21 at 19:35