While I don't know exactly what's causing OOM, I'd like to provide some general suggestions and elaborate on our discussion in the comments.
So the sequence will be kept in memory when I use any sort of loop,
but will not if I call run-processing directly? But in doseq docstring
it's clearly stated that "Does not retain the head of the sequence".
Then what should I do when I need to call run-processing several times
(e.g. with different arguments)?
So here's our function:
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(-> (xml/parse source)
((fn [x]
(doseq [i [0]]
(run-processing conn config x)))))))
Where x
is a lazy-seq
(if you're using data.xml
) like:
x <- xml iterator <- file stream
If run-proccessing
is doing everything right, (fully consumes x
and returns nil
) there's nothing wrong with it—the problem is the x
binding itself. While run-processing
runs, it fully realizes the sequence x
is the head of.
(defn process-xml! [conn config x]
(run-processing conn config x)
;; X IS FULLY REALIZED IN MEMORY
(run-reporting conn config x))
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(->> (xml/parse source)
(process-xml! conn config))))
As you can see, we're not consuming the file item by item and immediately throwing them away—all thanks to x
. doseq
has nothing to do with this: it "does not retain the head of the sequence" it consumes, which is [0]
in our case.
This approach is not very idiomatic for two reasons:
1. run-processing
is doing too much
It knows where data is coming from, in what shape, how to process it and what to do with the data. A more typical proccess-file!
would look like this:
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(->> (xml/parse source)
(find-item-nodes)
(map node->item)
(run! (partial process-item! conn config)))))
This is not always viable and doesn't fit every use case, but there's one more reason to do it this way.
2. process-file!
should (ideally) never give x
to anyone
This one is immediately obvious from looking at your original code: it's using with-open
. query
from clojure.java.jdbc
is a good example. What it does is gets ResultSet
, maps it to pure Clojure data structures, and forces it to be fully read (with result-set-fn
of doall
) to free the connection.
Notice how it never leaks ResultSet
and the only option is to get result seq (result-set-fn
) which is a "callback": query
wants to control ResultSet
lifecycle and make sure it's closed once query
returns. Otherwise it's too easy to make a similar mistake.
(But we can if we pass it a function similar to process-xml!
as result-set-fn
.)
Answers to comments
As I've said, I can't tell you exactly what's causing OOM. It could be:
run-processing
itself. JVM is low on memory anyway and adding a simple doseq
causes OOM. That's why I suggested slightly increasing heap size as a test.
Clojure optimizes x
binding away.
(fn [x] (run-processing conn config x))
is simply inlined by the JVM, subsequently fixing the issue with the x
binding.
So why does wrapping run-processing in doseq makes x retain head? In
my examples I don't use x more than once (contrary to your
"run-processing x THEN run-reporting on SAME x").
The root of the problem is not in the fact of reusing x
, it's about the sole fact of x
existing. Let's make a simple lazy-seq
:
(let [x (range 1 1e6)])
(Let's forget that range
is implemented as a Java class.)
What is x
? x
is a lazy seq head which is a recipe for constructing next value.
x = (recipe)
Let's advance it:
(let [x (range 1 1e6)
y (drop 5 x)
z (first y)])
Here are x
, y
and y
now:
x = (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (recipe)
y = (6) -> (recipe)
z = 6
Hope you can see now what I mean saying "x is a seq head and run-processing realizes it".
About "process-file! should (ideally) never give x to anyone" -
correct me if I'm wrong, but doesn't mapping to pure Clojure data
structures with doall makes them reside in memory, which would be bad
if the file is too big (as in my case)?
process-file!
doesn't use doall
. run!
is a reduce and returns nil.