Need suggestion for faster Linux fingerprint/hash comparison

Question

I'm currently using the linux md5sum command in a bash script on a very lightweight (low processor/low memory) linux device to return and record the checksums on thousands of similarly-named 32MB files in a single directory.

md5sum ./file* >fingerprint.txt

The next day, I repeat the process on the same set of files and programmatically compare the results from the prior day's hashes. When I find that the fingerprint of a file has changed between day1 and day2 I take action on that specific file. If the file remained unchanged I take no action and continue my comparison.

The problem that I'm running into is that the md5 method takes a LONG time to process on each file. The comparison needs to be completed within a certain time-frame and I'm starting to bump into incidents where the entire process simply takes too long.

Is there some other method/tool I could be using to reliably perform this kind of comparison? (note, it's not adequate enough to perform a date comparison on the files and the file sizes remain a constant 32MB)

How much of the files change? Will it be one byte in the middle of 32 megs, or is it likely that if any part of the file changes, most of the file will change? (e.g., could you check ten pages out of the file, and if all ten pages are the same, _skip the file_?) — sarnold, Apr 02 '11 at 08:39
Could you modify your application to keep track of _which_ files are updated as they are updated, and only run your check on those files? or will all N-thousand files be updated daily? — sarnold, Apr 02 '11 at 08:40
Do you really need md5sum? Can you just check the modification date of the file? — drysdam, Apr 02 '11 at 09:03
Another thought: when do you get the files? All in one dump at a particular time of day? If they trickle in over time and you're currently waiting to start making MD5s until everything is there then you could pick up time by having your process wake up periodically and process the files that have arrived. — coffeetocode, Apr 02 '11 at 10:16
@drysdam -- using cksum is actually faster by a factor of about 2x. I just wasn't certain whether I could trust the results produced by cksum. I need faultless comparisons. @sarnold and @coffeetocode-- The 32MB files are produced as a batch. It's actually a single 500GB file "split" into smaller 32MB parts. The source file only changes slightly from one day to another, but it's impossible to know what data will change or how much data will change. I break the huge file into smaller parts and take action on those parts that have changed. — Joe, Apr 03 '11 at 00:27
Copy them via rsync, log which files are copied (or would be copied - use the dry run option) — , Feb 04 '12 at 02:59

score 3 · Accepted Answer · answered Apr 03 '11 at 18:06

MD5 is supposed to be fast among cryptographic hash functions. But any given implementation may make choices which, on a specific machine, imply suboptimal performances. What kind of hardware do you use ? Processor type and L1 cache size are quite important.

You may want to have a look at sphlib: this is a library implementing many cryptographic hash functions, in C (optimized, but portable) and Java. The C code can be compiled with an additional "small footprint" flag which helps on small embedded platforms (mainly due to L1 cache size issues). Also, the code comes with a md5sum-like command-line utility, and a speed benchmark tool.

Among the hash functions, MD4 is usually the fastest, but on some platforms Panama, Radiogatun[32] and Radiogatun[64] can achieve similar or better performance. You may also want to have a look at some of the SHA-3 candidates, in particular Shabal, which is quite fast on small 32-bit systems.

Important note: some hash functions are "broken", in that it is possible to create collisions: two distinct input files, which hash to the same value (exactly what you want to avoid). MD4 and MD5 are thus "broken". However, a collision must be done on purpose; you will not hit one out of (bad) luck (probabilities are smaller than having a "collision" due to a hardware error during the computation). If you are in a security-related situation (someone may want to actively provoke a collision) then things are more difficult. Among those I cite, the Radiogatun and Shabal functions are currently unbroken.

Thanks. The device that's doing all this "heavy lifting" uses an ARM926EJ-S processor. If I'm reading /proc/cpuinfo correctly, the device has 16KB Instruction Cache and 16KB Data Cache. The system as 256MB of embedded RAM. I also learned that the standard HDD installed in the device has a rotational speed of 5400RPM and has other specs that make it a medium performer. I think what I'm learning from this is that if I want this process to move along faster, I'm going to need to spec faster hardware. Thanks also for your comments on other hash functions. I'll be looking into these. — Joe, Apr 04 '11 at 11:44
With `sphlib` implementation of MD4, on a 75 MHz ARM9 core, one can hash about 11 MB/s worth of data (16 kB of instruction cache is ample enough, a MD4 core implementation fits in about 2 kB). This assumes that the data is in level-1 data cache, and that the architecture runs in little-endian mode (overhead for endian-swap would be about +30% running time, with MD4). Comparatively, MD5 is around 7.8 MB/s on the same system. This is using ARM instructions; with Thumb instructions, bandwidth falls to 7.3 and 5.3 MB/s, respectively. Also, cost of loading data in RAM then L1 cache may be high. — Thomas Pornin, Apr 04 '11 at 12:24

score 0 · Answer 2 · answered Apr 02 '11 at 08:42

0

Ways to speed it up:

If you have multiple cores you could use more than one md5hash process at a time. But I suspect that your problem is disk access, in which case this may not help.
Do you really need to do MD5 hash? Check the modification date/time, size and INODE instead of the hash for a quick check
Consider Performing the quick check daily, and the slow MD5 check weekly

I suspect you don't really need to do an MD5 hash of every file every time, and you might be better off carefully considering your actual requirements, and what is the minimal solution which will meet them.

answered Apr 02 '11 at 08:42

Ben

34,935
6
74
113

Unfortunately I really need to do a hash comparison of the files. Can you think of any pitfalls to using cksum for comparisons instead? – Joe Apr 02 '11 at 22:35
cksum will be no quicker since your problem is certainly disk access. The only way to be faster is buy faster disks or find a way to avoid doing hashes of every single file. – Ben Apr 02 '11 at 22:43
Well, actually cksum is faster by a factor of about 2x. I just wasn't sure if I could trust the results for this type of comparison. – Apr 02 '11 at 23:30

Need suggestion for faster Linux fingerprint/hash comparison

2 Answers2