parallel check md5 file

Question

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:

cat checksums.md5 | parallel md5sum -c {}

But getting this error:

md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory

How can I parallel the md5sum checking?

score 10 · Accepted Answer · answered Dec 05 '15 at 01:13

10

Assuming checksums.md5 has the format:

d41d8cd98f00b204e9800998ecf8427e  My file name

Run:

cat checksums.md5 | parallel --pipe -N1 md5sum -c

If your files are small: -N100

If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.

answered Dec 05 '15 at 01:13

Ole Tange

31,768
5
86
104

Thanks guys. I tried to both --block and -N and use top to check num of cpu usage. --block only uses 1 cpu regardless of what value I put (1M, 10M, 100M). -N1 used up a lot of cpus, -N10 uses only a few cpus and -N0 & -N100 use only 1 cpu. Not sure why, but will use -N1 in the future. – Ken Dec 07 '15 at 02:42
1

The reason is that you only only have few files (i.e. the size of checksums.md5 is far less than 1 MB) – Ole Tange Dec 07 '15 at 16:08

Andrey · Answer 2 · 2015-12-04T19:46:47.990

1

You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:

cat checksums.md5 | parallel --pipe md5sum -c -

By default size of the block is 1 MB, can be changed with --block option.

edited Dec 04 '15 at 19:46

answered Dec 04 '15 at 06:56

Andrey

2,503
3
30
39

parallel check md5 file

2 Answers2