I want to parse a big json file (3GB) and return a hash-map for each line in this file. My intuition was to use a transducer to process the file line-by-line and construct a vector with some selected fields (> 5% of bytes in the file).
However, the following code throw an OutOfMemory exception:
file.json
{"experiments": {"results": ...}}
{"experiments": {"results": ...}}
{"experiments": {"results": ...}}
parser.clj
(defn load-with!
"Load a file using a parser, a structure and a transducer."
[parser structure xform path]
(with-open [r (clojure.java.io/reader path)]
(into structure xform (parser r))))
(def xf (map #(get-in % ["experiments" "results"])))
(def parser (comp (partial map cheshire.core/parse-string) line-seq))
(load-with! parser (vector) xf "file.json")
When I visualize the process with JVisualVM, the heap grows over time and exceeds 25 GB before the process crashes.
Are transducers appropriate in this case ? Is there an better alternative ?
One of my requirement to return the new structure at the end of the function. Thus, I cannot use doseq to process the file in-place.
Moreover, I need to change the parser and transducer according to the file format.
Thank you !