3

I'm doing what I thought was a fairly straightforward task: run a sql query (over about 65K rows of data) using sqlkorma library (http://sqlkorma.com), and for each row transforming it in some way, and then writing to CSV file. I don't really think that 65K rows is all that large given that I have a 8GB laptop, but I also assumed that a sql result set would be lazily fetched and so the whole thing would never get held in memory at the same time. So I was really really surprised when I ended up with this stack trace:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:777)
at clojure.lang.PersistentHashMap.createNode(PersistentHashMap.java:1101)
at clojure.lang.PersistentHashMap.access$600(PersistentHashMap.java:28)
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:749)
at clojure.lang.PersistentHashMap$TransientHashMap.doAssoc(PersistentHashMap.java:269)
at clojure.lang.ATransientMap.assoc(ATransientMap.java:64)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:56)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:100)
at clojure.lang.PersistentArrayMap.createHT(PersistentArrayMap.java:61)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:201)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:29)
at clojure.lang.RT.assoc(RT.java:702)
at clojure.core$assoc.invoke(core.clj:187)
at clojure.core$zipmap.invoke(core.clj:2715)
at clojure.java.jdbc$resultset_seq$thisfn__204.invoke(jdbc.clj:243)
at clojure.java.jdbc$resultset_seq$thisfn__204$fn__205.invoke(jdbc.clj:243)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.PersistentVector.create(PersistentVector.java:51)
at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31)
at clojure.core$vec.invoke(core.clj:354)
at korma.db$exec_sql$fn__343.invoke(db.clj:203)
at clojure.java.jdbc$with_query_results_STAR_.invoke(jdbc.clj:669)
at korma.db$exec_sql.invoke(db.clj:202)
at korma.db$do_query$fn__351.invoke(db.clj:225)
at clojure.java.jdbc$with_connection_STAR_.invoke(jdbc.clj:309)
at korma.db$do_query.invoke(db.clj:224)
at korma.core$exec.invoke(core.clj:474)
at db$query_db.invoke(db.clj:23)
at main$_main.doInvoke(main.clj:32)
at clojure.lang.RestFn.applyTo(RestFn.java:137)

As far as I can tell from the stack, it has not made it outside the query code (meaning it hasn't reached my transformation/write to CSV code at all). If it matters, my sql is fairly straightforward, basically SELECT * FROM my_table WHERE SOME_ID IS NOT NULL AND ROWNUM < 65000 ORDER BY some_id ASC. This is oracle (to explain rownum above), but I don' think that matters.

EDIT:

Code sample:

(defmacro query-and-print [q] `(do (dry-run ~q) ~q))
(defn query-db []  
    (query-and-print 
        (select my_table 
            (where (and (not= :MY_ID "BAD DATA")
                        (not= :MY_ID nil)
                        (raw (str "rownum < " rows))))
            (order :MY_ID :asc))))

; args contains rows 65000, and configure-app sets up the jdbc
; connection string, and sets a var with rows value
(defn -main [& args]
    (when (configure-app args) 
        (let [results (query-db)
              dedup (dedup-with-merge results)]
            (println "Result size: " (count results))
            (println "Dedup size: " (count dedup))
            (to-csv "target/out.csv" (transform-data dedup)))))
Kevin
  • 24,871
  • 19
  • 102
  • 158
  • Can you edit your OP and add some source code? Also suggest the error block be trimmed down a bit. – octopusgrabbus May 02 '13 at 15:45
  • Done. Not sure what to remove from error block: It shows that my code is not advancing past the `(query-db)` call, and it also shows where inside of `clojure.java.jdbc` the OOM is happening. As a side note, I started looking at the `clojure.java.jdbc` code, and it doesn't _look_ like it's lazy (which is crazy to me). – Kevin May 02 '13 at 16:03
  • when I add `(println "Result type: " (type results))` I get `Result type: clojure.lang.PersistentVector` which I suppose answers my question. – Kevin May 02 '13 at 16:28

2 Answers2

2

clojure.java.sql creates lazy sequences:

(defn resultset-seq
"Creates and returns a lazy sequence of maps corresponding to
 the rows in the java.sql.ResultSet rs. Based on clojure.core/resultset-seq
 but it respects the current naming strategy. Duplicate column names are
 made unique by appending _N before applying the naming strategy (where
 N is a unique integer)."
[^ResultSet rs]
(let [rsmeta (.getMetaData rs)
      idxs (range 1 (inc (.getColumnCount rsmeta)))
      keys (->> idxs
             (map (fn [^Integer i] (.getColumnLabel rsmeta i)))
             make-cols-unique
             (map (comp keyword *as-key*)))
      row-values (fn [] (map (fn [^Integer i] (.getObject rs i)) idxs))
      rows (fn thisfn []
             (when (.next rs)
               (cons (zipmap keys (row-values)) (lazy-seq (thisfn)))))]
  (rows)))

Korma fully realizes the sequence by dropping each row to a vector:

(defn- exec-sql [{:keys [results sql-str params]}]
(try
(case results
  :results (jdbc/with-query-results rs (apply vector sql-str params)
             (vec rs))
  :keys (jdbc/do-prepared-return-keys sql-str params)
  (jdbc/do-prepared sql-str params))
(catch Exception e
  (handle-exception e sql-str params))))
Kevin
  • 24,871
  • 19
  • 102
  • 158
  • How can you get around that fact, fetch one row at a time and process? – octopusgrabbus May 02 '13 at 19:33
  • The reason for not being lazy is obvious, coz you never know when the lazy data is actually realized and at that time the underlying sql connection will be closed – Ankur May 03 '13 at 04:36
  • 2
    There are a few pull requests related to this: https://github.com/korma/Korma/pull/66 and https://github.com/korma/Korma/pull/151 – Eelco May 05 '13 at 07:07
  • @Ankur, with very large result sets, it's not feasible to pull the whole result set into memory at once. There's no reason you have to always limit yourself to query results small enough to fit into memory; you just have to be careful to realize the lazy seq while the connection is open. I wrote a macro to make this easy in pull request 66, mentioned above. – Paul Legato May 06 '13 at 04:45
1

Besides the with-lazy-results route in https://github.com/korma/Korma/pull/66, as a completely different way to resovle the problem, you can simply increase the heap size available to your JVM by setting the appropriate flag. JVMs are not allowed to use all the free memory on your machine; they are strictly limited to the amount you tell them they're allowed to use.0

One way to do this is to set :jvm-opts ["-Xmx4g"] in your project.clj file. (Adjust the exact heap size as necessary.) Another way is to do something like:

export JAVA_OPTS=-Xmx:4g 
lein repl # or whatever lanuches your Clojure process

The with-lazy-results route is better in the sense that you can operate on any sized result set, but it's not merged into mainline Korma and requires some updating to work with recent versions. It's good to know how to adjust the JVM's allowed heap size anyway.

Paul Legato
  • 1,202
  • 13
  • 11