5

I am trying to parse a fairly small (< 100MB) xml file with:

(require '[clojure.data.xml :as xml]
         '[clojure.java.io :as io])

(xml/parse (io/reader "data/small-sample.xml"))

and I am getting an error:

OutOfMemoryError Java heap space
    clojure.lang.Numbers.byte_array (Numbers.java:1216)
    clojure.tools.nrepl.bencode/read-bytes (bencode.clj:101)
    clojure.tools.nrepl.bencode/read-netstring* (bencode.clj:153)
    clojure.tools.nrepl.bencode/read-token (bencode.clj:244)
    clojure.tools.nrepl.bencode/read-bencode (bencode.clj:254)
    clojure.tools.nrepl.bencode/token-seq/fn--3178 (bencode.clj:295)
    clojure.core/repeatedly/fn--4705 (core.clj:4642)
    clojure.lang.LazySeq.sval (LazySeq.java:42)
    clojure.lang.LazySeq.seq (LazySeq.java:60)
    clojure.lang.RT.seq (RT.java:484)
    clojure.core/seq (core.clj:133)
    clojure.core/take-while/fn--4236 (core.clj:2564)

Here is my project.clj:

(defproject dats "0.1.0-SNAPSHOT"
  ...
  :dependencies [[org.clojure/clojure "1.5.1"]
                [org.clojure/data.xml "0.0.7"]
                [criterium "0.4.1"]]
  :jvm-opts ["-Xmx1g"])

I tried setting a LEIN_JVM_OPTS and JVM_OPTS in my .bash_profile without success.

When I tried the following project.clj:

(defproject barber "0.1.0-SNAPSHOT"
  ...
  :dependencies [[org.clojure/clojure "1.5.1"]
                [org.clojure/data.xml "0.0.7"]
                [criterium "0.4.1"]]
  :jvm-opts ["-Xms128m"])

I get the following error:

Error occurred during initialization of VM
Incompatible minimum and maximum heap sizes specified
Exception in thread "Thread-5" clojure.lang.ExceptionInfo: Subprocess failed {:exit-code 1}

Any idea how I could increase the heap size for my leiningen repl?

Thanks.

Nicolas M.
  • 789
  • 1
  • 13
  • 23
  • Are Storing some data (XML parsing result) in an array? If yes, how much it is big? – Chiron Aug 07 '13 at 07:55
  • Are you invoking the second line from the REPL? – Leon Grapenthin Aug 07 '13 at 11:37
  • Chiron: Not storing the XML in any data structure yet. Just calling the parse method like in my post. Igrapenthin: Yes, I am invoking the parsing line from the REPL. The file is 50MB, unzipped. – Nicolas M. Aug 07 '13 at 15:28
  • 1
    As I note in my answer, all things returned at the top level of the repl are stored (and fully evaluated even if they would otherwise be lazy), stored first as *1, then as *2, etc. – noisesmith Aug 08 '13 at 16:31

2 Answers2

4

Any form evaluated at the top level of the repl is realized in full, as a result of the print step of the Read-Eval-Print-Loop. It is also stored in the heap, so that you can later access it via *1.

if you store the return value as follows:

(def parsed (xml/parse (io/reader "data/small-sample.xml")))

this returns immediately, even for a file hundreds of megabytes in size (I have verified this locally). You can then iterate across the result, which is realized in full as it is parsed from the input stream, by iterating over the clojure.data.xml.Element tree that is returned.

If you do not hold on to the elements (by binding them so they are still accessible), you can iterate over the entire structure without using more ram than it takes to hold a single node of the xml tree.

user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.739795 msecs"
#'user/n
user> (time (keys n))
"Elapsed time: 0.025683 msecs"
(:tag :attrs :content)
user> (time (-> n :tag))
"Elapsed time: 0.031224 msecs"
:catalog
user> (time (-> n :attrs))
"Elapsed time: 0.136522 msecs"
{}
user> (time (-> n :content first))
"Elapsed time: 0.095145 msecs"
#clojure.data.xml.Element{:tag :book, :attrs {:id "bk101"}, :content (#clojure.data.xml.Element{:tag :author, :attrs {}, :content ("Gambardella, Matthew")} #clojure.data.xml.Element{:tag :title, :attrs {}, :content ("XML Developer's Guide")} #clojure.data.xml.Element{:tag :genre, :attrs {}, :content ("Computer")} #clojure.data.xml.Element{:tag :price, :attrs {}, :content ("44.95")} #clojure.data.xml.Element{:tag :publish_date, :attrs {}, :content ("2000-10-01")} #clojure.data.xml.Element{:tag :description, :attrs {}, :content ("An in-depth look at creating applications \n      with XML.")})}
user> (time (-> n :content count))
"Elapsed time: 48178.512106 msecs"
459000
user> (time (-> n :content count))
"Elapsed time: 86.931114 msecs"
459000
;; redefining n so that we can test the performance without the pre-parsing done when we counted
user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.702885 msecs"
#'user/n
user> (time (doseq [el (take 100 (drop 100 (-> n :content)))] (println (:tag el))))
:book
:book
.... ;; output truncated
"Elapsed time: 26.019374 msecs"
nil
user> 

Notice that it is only when I first ask for the count of the content of n (thus forcing the whole file to be parsed) that the huge time delay occurs. If I doseq across subsections of the structure, this happens very quickly.

noisesmith
  • 20,076
  • 2
  • 41
  • 49
  • Thanks for the answer. I understand the point of lazy evaluation but in my case the call to (time (-> n :content count)) would also result in a java.lang.OutOfMemoryError: Java heap space error. Overall, I am trying to find a way to get more than 50MB of heap but cannot figure it out. – Nicolas M. Aug 08 '13 at 18:33
  • The error message "Incompatible minimum and maximum heap sizes specified" indicates to me that somewhere a low maximum is being set, that you somehow need to sidestep or override. The options that gave you that error specified a starting heap size (-Xms) but no max heap size (-Xmx) – noisesmith Aug 08 '13 at 18:43
  • Also, I don't know what you are trying to do, there will often be some reduction approach that accomplishes what you want without needing the whole dataset in memory at once. – noisesmith Aug 08 '13 at 18:45
  • I am playing with the data at this point. A goal would be to extract it from the XML into a structured db with the appropriate associations setup. When I specify both min heap and max heap, I see a similar error. Would you know where to check for default heap size configuration? Any idea if there is a way to know the heap size from the lein repl? Thanks a lot for your help! – Nicolas M. Aug 08 '13 at 18:51
  • 1
    `(.maxMemory (java.lang.Runtime/getRuntime))` will show the max memory available, .totalMemory is also available, etc. http://docs.oracle.com/javase/6/docs/api/java/lang/Runtime.html – noisesmith Aug 08 '13 at 19:34
  • Thanks! Both return 1060372480, that should be 1GB. It is unclear to me why parsing a 50MB file would run out of memory. :) – Nicolas M. Aug 08 '13 at 20:26
  • The data structure created by xml/parse is hundreds of times the size of the xml it comes from? Seems unlikely. – noisesmith Aug 08 '13 at 20:58
  • I agree. I believe the heap size is different from what's returned by maxMemory/totalMemory. A simple call to (range 1500000) - an array of 1.5M integers - also returns a OutOfMemoryError Java heap space error. – Nicolas M. Aug 08 '13 at 21:23
2

I don't know about lein as much but in mvn you can do the following:

mvn  -Dclojure.vmargs="-d64 -Xmx2G" clojure:nrepl

(I don't think it matters but I've always seen it with a capitol G is it case sensitive?)

Pulling 100MB of data into memory should be no problem. I routinely route GB worth of data through my projects.

I always use the server of 64bit version for large heaps too, and that seems to be what they are doing here:

JVM options using Leiningen

I think the bigger problem though, is that as you have it written this might be being evaluated at compile time. You need to wrap that call in a function, and defer it's execution. I think the compiler is trying to read that file, and that's likely not what you want. I know with mvn you get different memory settings for compile vs. run and you might be getting that too.

Community
  • 1
  • 1
DrLivingston
  • 788
  • 6
  • 15