2

I'm very pleased with the speed of using GNU parallel with splitting multi-GB CSV database export files into manageable chunks. However, the problem I'm having is that I'd like my output file names to be in the format some_table.csv.part_0000.csv and start at zero (the import tool requires this). Getting "0001" was a challenge, but I managed to use printf to achieve this. I can't get the decrement to work though.

My Command:

FILE=some_table; parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_$(printf "%04d"{#}).csv" :::: $FILE.csv

Doing things like expression expansion ($FILE.csv.part_$(({#}-1)).csv) don't work because {#} confuses the inner subshell. So does PART=$(({#}-1)); cat > $FILE.csv.part_$PART.csv.

Any suggestions?

Excalibur
  • 3,258
  • 2
  • 24
  • 32
  • What do you want to do with those CSV files? In other words, why do you want to split at all? – Michael-O May 18 '16 at 22:43
  • These are flat file exports, which will be loaded into AWS RDS, using mysqlimport. By splitting them into chunks, the transaction sizes are much more reasonable, as well as easier to resume on error. This is the general idea: (http://nerds.airbnb.com/mysql-in-the-cloud-at-airbnb/) – Excalibur May 18 '16 at 22:59
  • 1
    Isn't this a job for `split`? – Michael Vehrs May 19 '16 at 05:37
  • agreed, use `split` (or `awk`) instead of `parallel`. – webb May 19 '16 at 06:04

1 Answers1

4

Use the {= =} contruct:

FILE=some_table;  parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_"'{=$_=sprintf("%04d",$job->seq()-1)=}'".csv" :::: $FILE.csv

If you are going to use it a lot then define your own replacement string by putting this into ~/.parallel/config:

--rpl '{0000#} $_=sprintf("%04d",$job->seq()-1)'

Then use {0000#}:

seq 11 | parallel echo {0000#}

If you just want the numbers to be fixed width (and not necessarily 4 digits):

--rpl '{0#} $f="%0".int(1+log(total_jobs()-1)/log(10))."d";$_=sprintf($f,$job->seq()-1)'

Then use {0#}:

seq 11 | parallel echo {0#}

On a different note: Why save it to files at all? Why not pass it directly to the database importer and use --retries/--retry-failed to retry failed chunks?

If you want it for jobslot:

parallel --rpl '{0000%} $_=sprintf("%04d",$job->slot())' echo {0000%} ::: {1..100}

You can also use a dynamic replacement string:

--rpl '{(0+?)%} $l=length $$1; $_=sprintf("%0${l}d",$job->slot())'
--rpl '{(0+?)#} $l=length $$1; $_=sprintf("%0${l}d",$job->seq())'

parallel echo {0%} ::: {1..100}
parallel echo {0#} ::: {1..100}
parallel echo {00%} ::: {1..100}
parallel echo {00#} ::: {1..100}
parallel echo {000%} ::: {1..100}
parallel echo {000#} ::: {1..100}

Since version 20210222 you can do:

parallel --plus echo {0%} ::: {1..100}
parallel --plus echo {0#} ::: {1..100}

which will automatically detect the needed leading zeros.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thank you so much @Ole Tange, this is brilliant. I suspected that {= =} was a possible solution, but I found the documentation on it difficult to use (and completely missed the section about `--rpl`). – Excalibur May 19 '16 at 20:30
  • @Excalibur Return the favour by rewriting the documentation so you would have found it easier to use. – Ole Tange May 19 '16 at 21:39
  • Thanks! It'd be nice if a future version of Parallel had the {0#}, {00#}, {000#}, …, syntax bult-in. – Geremia Aug 31 '20 at 18:51
  • @OleTange How can we accomplish the same thing for the job slot number? – Rafael Jan 21 '21 at 21:57
  • The `-n` option with a number > 1 produces strange results: `seq 1024 | parallel -n 64 echo {0000#}` produces: `0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065 0065`… – Geremia Mar 16 '21 at 21:34
  • `seq 1024 | parallel -n 64 echo {#}` correctly produces: `1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16` – Geremia Mar 16 '21 at 21:36
  • @Geremia No-repro on 20210322. If you can repro with 20210222 ask a new question. – Ole Tange Mar 16 '21 at 22:06
  • @OleTange I opened the question: "[GNU Parallel with sequence number `{#}` and `-n` option](https://stackoverflow.com/q/66665339/1429450)." – Geremia Mar 17 '21 at 00:19