6

Clojure beginner/intermediate here,

I have a large XML file (~ 240M), which I need to process lazily item by item for ETL purposes. There is some run-processing function, which does a lot of stuff with side-effects, db interactions, writing to logs etc.

When I apply said function to file, everything runs smoothly:

...
(with-open [source (-> "in.xml"
                       io/file
                       io/input-stream)]
   (-> source
       xml/parse
       ((fn [x]
          ;; runs fine
          (run-processing conn config x)))))

But when I put the same function into any kind of loop (like doseq), I get the OutOfMemoryException (GC Overhead).

...
(with-open [source (-> "in.xml"
                       io/file
                       io/input-stream)]
  (-> source
      xml/parse
      ((fn [x]
         ;; throws OOM GC overhead exception
         (doseq [i [0]]
            (run-processing conn config x))))))

I don't understand, where does the head retention happen that causes GC overhead exception? I've already tried run! and even loop recur instead of doseq — same thing happens.

Must be something wrong with my run-processing function? Then why it behaves ok when I run it directly? Kinda confused, any help is appeciated.

Twice_Twice
  • 527
  • 4
  • 16
  • Have you tried increasing JVM heap size? Could it be, that doseq itself allocates just enough memory to cause OOM? – Igor Kharin Feb 22 '18 at 10:41
  • 1
    Also, you don't want to put run-processing in a loop. I suppose you are using data.xml, which is lazy, but x is a seq head. Once run-processing runs, the underlying sequence is fully realized and will be kept in memory for duration of doseq. – Igor Kharin Feb 22 '18 at 10:54
  • Havea look at https://clojuredocs.org/clojure.core/dorun ? – nha Feb 22 '18 at 11:10
  • @IgorKharin so the sequence will be kept in memory when I use any sort of loop, but will not if I call run-processing directly? But in doseq docstring it's clearly stated that "Does not retain the head of the sequence". Then what should I do when I need to call run-processing several times (e.g. with different arguments)? – Twice_Twice Feb 22 '18 at 15:15
  • @nha, I've already tried to use dorun with the same result. What exactly should I notice at clojuredocs? – Twice_Twice Feb 22 '18 at 15:16
  • can you post what run-processing is doing with its third parameter? how does it work through the seq it gets? – Jonah Benton Feb 22 '18 at 16:52

3 Answers3

5

To understand why your doseq doesn't work, we first have to understand why (run-processing conn config x) works:

The magic of Clojure here is locals clearning: It analyzes any code, and once a local binding is used the very last time, it is set to (Java) null before running that expression. So for

(fn [x])
   (run-processing conn config x))

The x will be cleared before running run-processing. Note: You can get the same OOM error when disabling locals clearing (a compiler option).

Now what happens when you write:

(doseq [_ [0])
   (run-processing conn config x))

How should the compiler know when x is used the very last time and clear it? I can't possibly know it: It's used within a loop. So it's never cleared and the x will retain the head.

Note: A smart JVM implementation could possibly change this in the future when it understands that the local memory location can't be accessed by the calling function anymore and offer the binding to the garbage collector. Though, current implementations aren't that smart.

Of course it's easy to fix it: Don't use x within a loop. Use other constructs like run! which is just a function call and will properly clear the local before invoking run!. Though, if you pass in the head of the seq to a function, that function will hold onto the head until the function (closure) is out of scope.

ClojureMostly
  • 4,652
  • 2
  • 22
  • 24
  • 1
    So, the problem is: by writing `(fn [x] ...)` I create a local binding to `x` that can not be cleared (and the only reason why it was cleared in first example was locals clearing)? But what if I want to process same file twice (e.g. I have different steps of processing)? Should I open it twice too? What if I have a stream opened instead of file? – Twice_Twice Feb 23 '18 at 11:23
  • 1
    No, the local fn parameter can certainly be cleared. Just like it is done in your first example. Again and very important: The Clojure compiler clears (it sets it to `null`) **at the last usage of the binding**. The clojure compiler can't do that if you use the binding in a loop (`doseq`). When should the compiler clear it? It isn't smart enough to clear it at the last iteration. Btw, not sure why you accepted the other answer... – ClojureMostly Feb 23 '18 at 12:37
  • I've already tried `run!` with the same results. Even if I do `(run! (partial run-processing conn config) [0 1])` and use no bindings at all. – Twice_Twice Feb 23 '18 at 12:40
  • 2
    Right, same problem: If you iterate over the seq twice you'll need the head of the seq for the second iteration. There is no way around it. Think about it: How should the second iteration work if you garbage collected the sequence? You need to do the work in one iteration instead of two. Then you can garbage collect right while iterating. – ClojureMostly Feb 23 '18 at 12:54
  • Then what should be done if I need to process same stream in several steps/iterations? And what did you mean when you wrote: *"Of course it's easy to fix it: Don't use x within a loop. Use other constructs like run! which is just a function call and will properly clear the local before invoking run!"*? – Twice_Twice Feb 23 '18 at 13:01
  • 1
    Well that statement was an **direct** answer to your original question. You asked why the first codes works (ans: locals clearing) but the code wrapped in `doseq` doesn't (locals clearing fails). If you need to walk the seq twice, there is simply no way. You should be able to write it so that you only walk it once. If absolutely necessary read the input twice. So your second question isn't answered in my answer. I simply answered your original question. – ClojureMostly Feb 23 '18 at 15:26
2

While I don't know exactly what's causing OOM, I'd like to provide some general suggestions and elaborate on our discussion in the comments.

So the sequence will be kept in memory when I use any sort of loop, but will not if I call run-processing directly? But in doseq docstring it's clearly stated that "Does not retain the head of the sequence". Then what should I do when I need to call run-processing several times (e.g. with different arguments)?

So here's our function:

(defn process-file! [conn config name]
  (with-open [source (io/input-stream (io/file name))]
    (-> (xml/parse source)
        ((fn [x]
           (doseq [i [0]]
             (run-processing conn config x)))))))

Where x is a lazy-seq (if you're using data.xml) like:

x <- xml iterator <- file stream

If run-proccessing is doing everything right, (fully consumes x and returns nil) there's nothing wrong with it—the problem is the x binding itself. While run-processing runs, it fully realizes the sequence x is the head of.

(defn process-xml! [conn config x]
  (run-processing conn config x)
  ;; X IS FULLY REALIZED IN MEMORY
  (run-reporting conn config x))

(defn process-file! [conn config name]
  (with-open [source (io/input-stream (io/file name))]
    (->> (xml/parse source)
         (process-xml! conn config))))

As you can see, we're not consuming the file item by item and immediately throwing them away—all thanks to x. doseq has nothing to do with this: it "does not retain the head of the sequence" it consumes, which is [0] in our case.


This approach is not very idiomatic for two reasons:

1. run-processing is doing too much

It knows where data is coming from, in what shape, how to process it and what to do with the data. A more typical proccess-file! would look like this:

(defn process-file! [conn config name]
  (with-open [source (io/input-stream (io/file name))]
    (->> (xml/parse source)
         (find-item-nodes)
         (map node->item)
         (run! (partial process-item! conn config)))))

This is not always viable and doesn't fit every use case, but there's one more reason to do it this way.

2. process-file! should (ideally) never give x to anyone

This one is immediately obvious from looking at your original code: it's using with-open. query from clojure.java.jdbc is a good example. What it does is gets ResultSet, maps it to pure Clojure data structures, and forces it to be fully read (with result-set-fn of doall) to free the connection.

Notice how it never leaks ResultSet and the only option is to get result seq (result-set-fn) which is a "callback": query wants to control ResultSet lifecycle and make sure it's closed once query returns. Otherwise it's too easy to make a similar mistake.

(But we can if we pass it a function similar to process-xml! as result-set-fn.)


Answers to comments

As I've said, I can't tell you exactly what's causing OOM. It could be:

  1. run-processing itself. JVM is low on memory anyway and adding a simple doseq causes OOM. That's why I suggested slightly increasing heap size as a test.

  2. Clojure optimizes x binding away.

  3. (fn [x] (run-processing conn config x)) is simply inlined by the JVM, subsequently fixing the issue with the x binding.

So why does wrapping run-processing in doseq makes x retain head? In my examples I don't use x more than once (contrary to your "run-processing x THEN run-reporting on SAME x").

The root of the problem is not in the fact of reusing x, it's about the sole fact of x existing. Let's make a simple lazy-seq:

(let [x (range 1 1e6)])

(Let's forget that range is implemented as a Java class.)

What is x? x is a lazy seq head which is a recipe for constructing next value.

x = (recipe)

Let's advance it:

(let [x (range 1 1e6)
      y (drop 5 x)
      z (first y)])

Here are x, y and y now:

x = (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (recipe)
y = (6) -> (recipe)
z = 6

Hope you can see now what I mean saying "x is a seq head and run-processing realizes it".

About "process-file! should (ideally) never give x to anyone" - correct me if I'm wrong, but doesn't mapping to pure Clojure data structures with doall makes them reside in memory, which would be bad if the file is too big (as in my case)?

process-file! doesn't use doall. run! is a reduce and returns nil.

Igor Kharin
  • 699
  • 6
  • 7
  • Thanks for your elaborate answer, but I still miss something. 1) I use the `x` binding in both scenarios: calling run-processing directly (first code fragment) and in doseq loop (second code fragment). So why does wrapping run-processing in doseq makes x retain head? In my examples I don't use x more than once (contrary to your "run-processing x THEN run-reporting on SAME x"). – Twice_Twice Feb 23 '18 at 06:45
  • 2) In my case run-processing is supposed to do much - it's a top function that does `find-item-nodes`, mapping and other stuff, and is driven by multimethods and `config` parameter. 3) About "process-file! should (ideally) never give x to anyone" - correct me if I'm wrong, but doesn't mapping to pure Clojure data structures with `doall` makes them reside in memory, which would be bad if the file is too big (as in my case)? – Twice_Twice Feb 23 '18 at 06:45
  • @Twice_Twice "holding onto head" happens to the best of us. Finally found [the video](https://youtu.be/VC_MTD68erY?t=10m23s)! – Igor Kharin Feb 23 '18 at 09:00
  • 1
    I think I understand better now (thanks for video btw!) and the gist is: don't put lazy-seq into any kind of binding like `let` and don't create anonymous `fn` with binding the seq :) – Twice_Twice Feb 23 '18 at 11:26
  • I guess this answer wasn't accurate after all. It seems the reason of head retention *was* my using of loop, and not binding `x`. – Twice_Twice Feb 23 '18 at 13:12
  • 1
    @Twice_Twice I guess I really suck at explaining this, I'm sorry. :( Again, the problem is not with `doseq`, but in the fact that we have a seq head reference and its underlying sequence gets fully realized. Q: Is there any difference between `(do (run-processing x) (run-processing x))` and `(dotimes [i 2] (run-processing x))`? That's exactly what I was trying to illustrate with `process-xml!`: there's no difference and it's not because of the loop. – Igor Kharin Feb 23 '18 at 14:12
-1

Can you post a concrete example, even if it is too small to generate an OOM exception?

The first thing I see is that you are creating a function using (fn [x] ...) and then immediately calling it with a 2nd pair of parentheses:

   (-> source
       xml/parse
       ((fn [x]
          ;; runs fine
          (run-processing conn config x)))))

This looks very strange to me. Why are you structuring the code this way?

In the failing doseq example, you have the same structure:

  (-> source
      xml/parse
      ((fn [x]
         ;; throws OOM GC overhead exception
         (doseq [i [0]]
            (run-processing conn config x))))))

You will also notice that the upper bound in the doseq is a one-element vector, with a strange symbol inside. Is this meant to be "infinity" or something? If so, why is it wrapped in a vector? This looks like a problem (or perhaps a clojure.core bug), since a doseq loop over a one-element vector should run exactly once.

Another point, the loop variable i is never used - is this intentional? It seems very different from the 1st (working) example.

Also, it is possible that (depending on the details of your code) some interaction between creating a function, which contains a doseq, and then calling it immediately, is the cause of the problem.

Update:

Re the (fn [x] ...) form, I would write it like so:

(->  source
     xml/parse
     #(run-processing conn config %)))

or

(->> source     ; note "thread-last" macro
     xml/parse
     (run-processing conn config)))

Perhaps for the doseq`, you intended something more like this:

(-> source
    xml/parse
    #(doseq [single-item %]
      (run-processing conn config single-item)))

However, in this case we are calling run-processing many times for a single item at a time, whereas before we were calling run-processing once and passing in the whole lazy result from xml/parse.

Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
  • Why does it look strange to you? I thought that was a standard practice when you use threading macros, same as `(#(run-processing %))`. The `doseq` one-element vector is merely to illustrate the point of question. – Twice_Twice Feb 23 '18 at 06:53
  • Aren't you supposed to wrap *anonymous functions* in additional parentheses? I think you're mistaken. – Twice_Twice Feb 23 '18 at 11:47