1

how to modify this code to cope with larger files (2 GB)? In Java - use small buffer and update(), in Clojure - how?

(defn md5 [io-factory]
      (let [bytes'
            (with-open [xin (clojure.java.io/input-stream io-factory)
                        xout (java.io.ByteArrayOutputStream.)]
              (clojure.java.io/copy xin xout)
              (.toByteArray xout))
            algorithm (java.security.MessageDigest/getInstance "MD5")
            raw (.digest algorithm bytes')]
        (format "%032x" (BigInteger. 1 raw))))

; Execution error (OutOfMemoryError) at java.util.Arrays/copyOf (Arrays.java:3236).
; Java heap space

Thank you for your answers.

ivitek
  • 11
  • 2
  • The code relies on Java's standard library for hashing, so whichever technique works "in Java" would work in Clojure. Did something go wrong when you tried the small buffer / update approach? – Biped Phill Mar 07 '21 at 16:23
  • No, as a newbie to Clojure, familiar with Python, I didn't find a way how to write it in Clojure. Clojure is so much different from what I know... – ivitek Mar 08 '21 at 16:11

2 Answers2

4

You can use a DigestInputStream to calculate a hash without holding all bytes into memory simultaneously since it incrementally computes the hash as you consume bytes from the source stream.


(defn copy+md5 [source sink]
  (let [digest (MessageDigest/getInstance "MD5")]
    (with-open [input-stream  (io/input-stream source)
                digest-stream (DigestInputStream. input-stream digest)
                output-stream (io/output-stream sink)]
      (io/copy digest-stream output-stream))
    (format "%032x" (BigInteger. 1 (.digest digest)))))

If you're not doing anything with the contents of the source other than computing a hash you could use the /dev/null equivalent (OutputStream/nullOutputStream) instance for the sink.

RutledgePaulV
  • 2,568
  • 3
  • 24
  • 47
2

clj-digest uses a small buffer to calculate MD5 and other message digests.

Steffan Westcott
  • 2,121
  • 1
  • 3
  • 13