0

We have one rather large table containing documents info together with filepaths pointing to files on file system. After couple of years we noticed that we have files on the disk which are not referenced in DB table and vice-versa.

Since currently I'm learning Clojure I tought it would be nice to make small utility which can find diff between db and file system. Naturally, since i'm beginner I got stucked because there's more than 600 000 documents and obviously I need some more performant and less memory consuming solution :)

My first idea was to generate flatten filesystem tree list with all files, and compare it with list from db, if file doesn't exist put in separate list "non-existing" and if some file exists on HDD and not in DB, move it to some dump directory.

Any ideas?

zarko.susnjar
  • 2,053
  • 2
  • 17
  • 35
  • How is the solution "memory consuming"? I mean, there's really only one option: check the filesystem against the DB, and vice-versa. If memory is a problem, just split the problem up into chunks, and proceed accordingly. – Dave Newton Sep 20 '11 at 10:54
  • In a way that 600k lines long list of documents creates 60MB file, and same for filesystem side, when i try to run intersection function over those 2 maps process hangs. – zarko.susnjar Sep 20 '11 at 10:59
  • That's why I suggested breaking the problem up into chunks. – Dave Newton Sep 20 '11 at 11:01
  • Well that's how I'd do it in Java, using BufferedReaders and let it run for ages :) – zarko.susnjar Sep 20 '11 at 11:07
  • I hoped there's some pattern for making this kind of diff in Clojure which uses more functional approach from which I could learn more. – zarko.susnjar Sep 20 '11 at 11:09

2 Answers2

1

As a sketch, here's how you could check the filesystem against the database, in chunks of whatever size you're happy with:

(->> (file-seq (java.io.File. "/"))
     (remove (memfn isDirectory))
     (partition 20)
     (map (fn [files] (printf "Checking %d files against db...\n" (count files))))
     (take 2))

(Checking 20 files against db...
Checking 20 files against db...
nil nil)

Instead of using printf, do some kind of database checks against the list of files.

amalloy
  • 89,153
  • 8
  • 140
  • 205
0

I would suggest one of three options depending on your preference for performance vs. memory:

  1. Memory intensive: Use a recursive method calling File.listFiles to put all the files into a list. Then compare the list against your DB.

  2. IO intensive solution: Recursively check each file one at a time against the DB.

  3. Intermediate solution: read all the files in one dir, compare them against the DB. Recurse on any sub-dirs and repeat. Has the same number of IO calls as option 1 but only holds one branch + one dir worth of file paths in memory at any one time.

John B
  • 32,493
  • 6
  • 77
  • 98
  • `file-seq` already does the recursive `File.listFiles` stuff. And it's lazy, so you can use the list for a memory-heavy approach (force the whole seq, check db), or an IO-heavy approach (map `check-in-db` over each element lazily). Or you can split it into chunks of, say, 100 files for each db call. – amalloy Sep 20 '11 at 20:57
  • if **file-seq** a Clojure thing? Is it available in Java? I know for Java, apache commons FileUtils will do recursive listing but doesn't allow for control to prevent memory overuse. – John B Sep 20 '11 at 21:04
  • 1
    Yes; No. Clojure has pervasive lazy sequences, making it easy to work with large (even infinite) sequences "as if" they were entirely in memory at once. In Java, you have to fake that sort of thing with the Visitor pattern or whatever. See my answer for an example of how you might do this in Clojure. – amalloy Sep 20 '11 at 21:44