1

I want to encrypt and decrypt big files (think 20m lines) of text. The encryption service I am using can only encrypt a maximum size of 64kb. For the purposes of this question assume we are stuck with this service.

My solution is to split the huge file in chunks of 64kb, encrypt all of them in parallel and put the encrypted parts in a tar.gz. Each part is numbered as part-xxx to make sure I can restore the original file. At decryption time I unzip, decrypt each part in parallel and concat results in order.

The fun part: When I do that last part on a big enough file one of the following happens:

  1. The tmux sessions dies, and i get logged out. No logs, no nothing.

  2. I get this:

/home/estergiadis/kms/decrypt.sh: line 45: /usr/bin/find: Argument list too long
/home/estergiadis/kms/decrypt.sh: line 46: /bin/rm: Argument list too long

I tried several solutions based on xargs with no luck. Here is the interesting code:

echo "Decrypting chunks in parallel."
# -1 -f in ls helped me go from scenario 1 to scenario 2 above. 
# Makes sense since I don't need sorting at this stage.
ls -1 -f part-* | xargs -I % -P 32 bash -c "gcloud kms decrypt --ciphertext-file % --plaintext-file ${OUTPUT}.%"

# Best case scenario, we die here
find $OUTPUT.part-* | xargs cat > $OUTPUT
rm $OUTPUT.part-*

Even more interesting: when find and rm report a problem, I can go to the temp folder with all parts, run the exact same commands myself and everything works.

In case it matters, all of this takes place in a RAM mounted filesystem. However RAM cannot possibly be the issue: I am on a machine with 256GB RAM, the files involved take up 1-2GB and htop never shows more than 10% usage.

user3666197
  • 1
  • 6
  • 50
  • 92
pilu
  • 720
  • 5
  • 16
  • 1
    You have a machine with 256GB of RAM installed but send your unencrypted data out across the Internet in 64kB chunks to get it encrypted? – Mark Setchell Feb 20 '20 at 16:19
  • both encryption and decryption happen on the same machine with 256GB RAM. Nothing is sent across the internet. – pilu Feb 20 '20 at 16:21
  • Not sure I get the point of this, but your `find` command is wrong. You want `find . -name "$OUTPUT.part-*" ...` – Mark Setchell Feb 20 '20 at 16:37
  • ... and rather than `find ... | xargs...` you probably want `find ... -exec cat {} \; > somewhere` – Mark Setchell Feb 20 '20 at 16:41
  • The variants with and without -name return the same results for me. The xargs thing exists there precisely to deal with too many files, since cat by itself dies. – pilu Feb 20 '20 at 16:48
  • This is the time you switch to C and stop messing about with `bash` or `sh`. By the way, always include the language and / or runtime environment in your tags. – Maarten Bodewes Feb 20 '20 at 17:03
  • 1
    `xargs` has the same limits as `cat`, see `sysctl -a | grep -i argmax` – Mark Setchell Feb 20 '20 at 17:37
  • @MarkSetchell "xargs has the same limits as cat" is true, but why is that relevant here? Did you mean "find has the same limits [...]"? – jhnc Feb 21 '20 at 04:29
  • @jhnc You are right, it is not relevant as `xargs` is reading filenames via `stdin` not via parameters. I normally prefer and use the `-exec` option but OP appeared reluctant to take that route. – Mark Setchell Feb 21 '20 at 07:44

1 Answers1

2

Your problem is with these:

ls -1 -f part-* | ...
find $OUTPUT.part-* | ...
rm $OUTPUT.part-*

If you have too many parts (part-*, etc), the filename expansion done by the shell will result in a command with too many arguments or you may exceed the maximum command length.

find + xargs allows you to overcome this. You can replace any command that uses a glob to list files in the current directory with, for example:

find . -name GLOB -print -o ! -path . -prune | xargs CMD

The -o ! -path . -prune tells find to not descend into subdirectories. xargs ensures the generated commandlines do not exceed the maximum argument or line limits.

So for the three lines you could do:

globwrap(){
    glob="$1"
    shift

    find . -name "$glob" -print -o ! -path . -prune |\
    sed 's/^..//' |\
    xargs "$@" # defaults to echo if no command given
}

globwrap 'part-*' | ...
globwrap "$OUTPUT"'.part-*' | ...
globwrap "$OUTPUT"'.part-*' rm

Single-quotes prevent the shell expanding the glob we are passing to find.

sed strips out the ./ that would otherwise be prepended to each filename.

Note that the original ls and find are no longer needed in the first two cases.

jhnc
  • 11,310
  • 1
  • 9
  • 26