7

I'm building a nodewebkit app that keeps a local directory in sync with a remote FTP. To build the initial index when the app is run for the first time I download an index file from the remote server containing a hash for all the files and their folders. I then run through this list and find matches in the user's local folder.

The total size of the remote/local folder can be over 10GB. As you can imagine, scanning 10GB worth of individual files can be pretty slow, especially on a normal HDD (not SSD).

Is there a way in node to efficiently get a hash of a folder without looping through and hashing every individual file inside? That way if the folder hash differs I can choose to do the expensive individual file checking or not (which is how I do it once I have a local index to compare against the remote one).

Titan
  • 5,567
  • 9
  • 55
  • 90

1 Answers1

2

You could iteratively walk the directories, stat the directory and each file it contains, not following links and produce a hash. Here's an example:

'use strict';

// npm install siphash
var siphash = require('siphash');
// npm install walk
var walk = require('walk');

var key = siphash.string16_to_key('0123456789ABCDEF');
var walker  = walk.walk('/tmp', {followLinks: false});

walker.on('directories', directoryHandler);
walker.on('file', fileHandler);
walker.on('errors', errorsHandler); // plural
walker.on('end', endHandler);

var directories = {};
var directoryHashes = [];

function addRootDirectory(name, stats) {
    directories[name] = directories[name] || {
        fileStats: []
    };

    if(stats.file) directories[name].fileStats.push(stats.file);
    else if(stats.dir) directories[name].dirStats = stats.dir;
}

function directoryHandler(root, dirStatsArray, next) {
    addRootDirectory(root, {dir:dirStatsArray});
    next();
}

function fileHandler(root, fileStat, next) {
    addRootDirectory(root, {file:fileStat});
    next();
}

function errorsHandler(root, nodeStatsArray, next) {
    nodeStatsArray.forEach(function (n) {
        console.error('[ERROR] ' + n.name);
        console.error(n.error.message || (n.error.code + ': ' + n.error.path));
    });
    next();
}

function endHandler() {
    Object.keys(directories).forEach(function (dir) {
        var hash = siphash.hash_hex(key, JSON.stringify(dir));
        directoryHashes.push({
            dir: dir,
            hash: hash
        });
    });

    console.log(directoryHashes);
}

You would want of course to turn this into some kind of command-line app to take arguments probably and double check that the files are returned in the correct order every time (maybe sort the file stats based on file name prior to hashing!) so that siphash returns the right hash every time.

This is not tested code.. just to provide an example of where I'd likely start with that sort of thing.

Edit: and to reduce dependencies, you could use Node's crypto lib instead of siphash if you want require('crypto'); and walk/stat the directories and files yourself if you'd like of course.

Matt Mullens
  • 2,266
  • 15
  • 14
  • Thanks, I'm actually already using crypto to get the hash! This method doesn't actually check the content though does it? So if one character in a file was different to the remote directory this method wouldn't detect that? – Titan Jun 25 '15 at 18:28
  • Right it would not check content, but does stat the file which should include modified timestamp, so it'd be a quick check you may be able to do perhaps much more frequently. Hashing the file content is probably what's slowing things down significantly which is why I was thinking that if I were doing this I'd probably start off with a quick check like this, and then perhaps less regularly do a more full check. – Matt Mullens Jun 25 '15 at 18:58
  • Actually, I'd probably want to maintain remote and local last hash results in a persistent store for these files, so that if the file were updated on either client or server, it could compare the last hash in the persistent store with the current one - maybe a good use case for redis. I wouldn't compare the stats on the server and the client directly because of course many things will be different between these file stats - nodes and whatnot. – Matt Mullens Jun 25 '15 at 19:06
  • However, comparing against last known hash in an independent persistent store could also offer some speed advantages and also not require frequent connection between the client and server to be established. The server could independently check its directory and file contents while the client did the same, and if a difference in either is detected a flag could be set on the persistent store for this file such that the client could then make that more expensive connection to the server to obtain the latest file (or vice versa). – Matt Mullens Jun 25 '15 at 19:08
  • Comparing 2 hash indexes is exactly what I do once I've done the first full scan, it's very fast naturally. It's just the first full scan I need to speed up before I have a local index of hashes. Timestamp wouldn't be much use unfortunately as the client could have added the files long before the server. – Titan Jun 25 '15 at 20:27
  • If the hash exists for a file on the client, and a hash does not exist for a file on the server, based on the same relative file path from some specified root in each, then it can be assumed client added a file that the server doesn't have, in which case a connection from client to server is established and file is copied up to server. Then on the server, the hash of the stats of this file are taken and stored elsewhere. At some point down the road perhaps the server file changes, this sync process running on the server notices that the new stats hash and the old one are different. – Matt Mullens Jun 26 '15 at 13:13
  • So it updates the hash in the persistent store but also sets a dirty flag in the persistent store associated with the given relative file path. Then the client process notices that dirty flag set on that file, establishes connection, pulls down file, and the process can continue. I'm kind of thinking like a structure that looks like: `{file: 'a/b/c', serverHash: '123', clientHash: 'abc', serverFlag: false, clientFlag: false }` where serverHash and clientHash are updated by server and client respectively (so that each can independently check last hash), and dirty flags to communicate change. – Matt Mullens Jun 26 '15 at 13:17
  • And so when you have something that has: `{file: 'a/b/c', clientHash: 'abc', clientFlag: true}` or something like that, it's a new file from the client so the serverHash and serverFlag don't exist (server hasn't had a chance to add those pieces to the structure in the persistent store). And, client added a file, so it didn't have a previous clientHash to compare to, so it sets clientFlag to true. – Matt Mullens Jun 26 '15 at 13:23
  • Yer I don't have any issues syncing files from the server to the client (it's not both ways btw). And I've no problem with speed once I've got the hash index from the server, scanned every one of those files on the client and got (and saved) their hashes to work out the initial differences to store a local hash index (so all future comparisons are rapid). It's just the initial client hashing speed I'm struggling with when scanning 10+ GB of files. I hoped there would be a magic windows API I could use that might already have a hash for the file without me reading all it's content the 1st time – Titan Jun 26 '15 at 23:52