Using Clojure Tranducers to parse big files: OutOfMemory Error

Question

I want to parse a big json file (3GB) and return a hash-map for each line in this file. My intuition was to use a transducer to process the file line-by-line and construct a vector with some selected fields (> 5% of bytes in the file).

However, the following code throw an OutOfMemory exception:

file.json

{"experiments": {"results": ...}}
{"experiments": {"results": ...}}
{"experiments": {"results": ...}}

parser.clj

(defn load-with!
  "Load a file using a parser, a structure and a transducer."
  [parser structure xform path]
  (with-open [r (clojure.java.io/reader path)]
    (into structure xform (parser r))))

(def xf (map #(get-in % ["experiments" "results"])))
(def parser (comp (partial map cheshire.core/parse-string) line-seq))

(load-with! parser (vector) xf "file.json")

When I visualize the process with JVisualVM, the heap grows over time and exceeds 25 GB before the process crashes.

Are transducers appropriate in this case ? Is there an better alternative ?

One of my requirement to return the new structure at the end of the function. Thus, I cannot use doseq to process the file in-place.

Moreover, I need to change the parser and transducer according to the file format.

Thank you !

I don't fully understand your code. What is the role of parser? It seems to be passed but unused. Also the expression `(r)` is probably not what you want, it calls the reader as a function. — Michiel Borkent, Oct 22 '16 at 21:23
I don't see why transducers would help. Transducers are useful when you have a series of operations you want to perform on the data; the transducer allows you to avoid creating intermediate data structures that will be thrown away. This code does only one thing--it maps `get-in`. Note `into` is non-lazy. Could you process the file lazily? Using `for`, `map`, or the `sequence` transducer function, could you create a lazy sequence of map entries? If they are properly handled, you could process each one without keeping all of the file contents in memory. — Mars, Oct 22 '16 at 22:34
the goal of the parser/transducer is to easily adapt the work according to the file format (e.g. json, csv ...) and the vendor format within the file. — , Oct 23 '16 at 19:25
Could you give some more specifics on the data in the JSON file, e.g, number of lines and the size of each line? Or, even better, upload a representative version of the file somewhere so that we can reproduce the problem exactly? I tried your code on a very small file and that worked fine, but I was expecting it to break since getting 25G of memory usage from a 3G file suggests some sort of infinite loop or something. — Robert Johnson, Dec 19 '16 at 09:14
@Mars Yes, in this particular case the xform isn't doing much. However for a different file you might wish to apply some filtering as well as some get-in operation, in which case having the load-with! fn accept an xform is definitely useful. As for processing the file lazily, as far as I can tell that *should* be the case since line-seq is lazy and so is map, but the OOM error obviously suggests that something is going wrong somewhere. Of course `into` is non-lazy but `load-with!` *must* return something non-lazy, and I think the point is that the extracted data is expected to fit into memory. — Robert Johnson, Dec 19 '16 at 09:58
@RobertJohnson, thanks for that elaboration. Everything you wrote makes sense to me. Freaxmind, I agree that transducers are intended to provide a flexible interface to numerous sorts of things that provide something like stream of data, but lots of methods can do that. I think of transducers as more specific, as indicated by other comments above. — Mars, Dec 19 '16 at 17:11

Brandon Henry · Answer 1 · 2016-10-23T03:46:19.243

You're pretty close. I don't know what json/parse-string does, but if it's the same as json/read-str from here then this code should be what you are trying to do up there.

It looks like you were going for something like this:

(require '[clojure.data.json :as json])
(require '[clojure.java.io :as java])

(defn load-with!
  "Load a file using a parser, a structure and a transducer."
  [parser structure xform path]
  (with-open [r (java/reader path)]
    (into structure (xform (parser r)))))

(def xf (partial map #(get-in % ["experiments" "results"])))

(def parser (comp (partial map json/read-str) line-seq))


(load-with! parser [] xf "file.json")

I'm guessing these were just mistakes made from cutting out all the business details into your minimal example here. Using the code below I was able to process a large file for which the code above gave me an OOM error:

(require '[clojure.data.json :as json])
(require '[clojure.java.io :as java])

(def structure (atom []))

(defn do-it! [xform path]
  (with-open [r (java/reader path)]
    (doseq [line (line-seq r)]
      (swap! structure conj (xform line)))))

(defn xf [line]
  (-> (json/read-str line)
      (get-in ["experiments" "results"])))

(do-it! xf "file.json")

(take 10 @structure)

Thank you for your proposal.is it necessary to use an atom in this — , Oct 23 '16 at 19:21
Thank you for your proposal. Is it necessary to have a global variable ? What is the difference compared to the solution with (into ...) ? — , Oct 23 '16 at 19:26
the first bit of code will work if you have enough memory. i think the atom is necessary with doseq. i ran out of time to research this, so my answer was only a minor improvement. — Brandon Henry, Oct 24 '16 at 12:17
Would be nice if anyone commented on why the initial code isn't really stream-processing in constant memory as expected (and why the suggested code is). — matanster, Mar 12 '18 at 19:49

Using Clojure Tranducers to parse big files: OutOfMemory Error

1 Answers1