5

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:

cat checksums.md5 | parallel md5sum -c {}

But getting this error:

md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory

How can I parallel the md5sum checking?

Ken
  • 3,922
  • 9
  • 39
  • 40

2 Answers2

10

Assuming checksums.md5 has the format:

d41d8cd98f00b204e9800998ecf8427e  My file name

Run:

cat checksums.md5 | parallel --pipe -N1 md5sum -c

If your files are small: -N100

If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thanks guys. I tried to both --block and -N and use top to check num of cpu usage. --block only uses 1 cpu regardless of what value I put (1M, 10M, 100M). -N1 used up a lot of cpus, -N10 uses only a few cpus and -N0 & -N100 use only 1 cpu. Not sure why, but will use -N1 in the future. – Ken Dec 07 '15 at 02:42
  • 1
    The reason is that you only only have few files (i.e. the size of checksums.md5 is far less than 1 MB) – Ole Tange Dec 07 '15 at 16:08
1

You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:

cat checksums.md5 | parallel --pipe md5sum -c -

By default size of the block is 1 MB, can be changed with --block option.

Andrey
  • 2,503
  • 3
  • 30
  • 39