4

I am getting an exception parsing an XML file with clojure.data.xml, because the stream is closing before the parsing is complete.

What I do not understand is why doall is not forcing the evaluation of the XML data before with-open closes it (as suggested by this related answer):

(:require [clojure.java.io :as io]
          [clojure.data.xml :as xml])

(defn file->xml [path] 
  (with-open [rdr (-> path io/resource io/reader)] 
    (doall (xml/parse rdr))))

Which throws the exception:

(file->xml "example.xml")
;-> XMLStreamException ParseError at [row,col]:[80,1926]
Message: Stream closed com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next

If I remove the with-open wrapper, it returns the XML data as expected (so the file is legit though the reader is not guaranteed closed).

I see that (source xml/parse) yields lazy results:

(defn parse
  "Parses the source, which can be an
   InputStream or Reader, and returns a lazy tree of Element records. 
   Accepts key pairs with XMLInputFactory options, see http://docs.oracle.com/javase/6/docs/api/javax/xml/stream/XMLInputFactory.html
   and xml-input-factory-props for more information. 
   Defaults coalescing true."
   [source & opts]
     (event-tree (event-seq source opts)))

so perhaps that is related, but the function I have is very similar to the "round-trip" example on the clojure.data.xml README.

What am I missing here?

Community
  • 1
  • 1
nrako
  • 2,952
  • 17
  • 30

1 Answers1

3

I was surprised to see this behavior. It appears that clojure.data.xml.Element (the return type) implements a type of "lazy map" that is immune to the effects of doall.

Here is a solution which transforms the lazy values into normal maps:

(ns tst.clj.core
  (:use clj.core clojure.test tupelo.test)
  (:require
    [tupelo.core :as t]
    [clojure.string :as str]
    [clojure.pprint :refer [pprint]]
    [clojure.java.io :as io]
    [clojure.data.xml :as xml]
    [clojure.walk :refer [postwalk]]
  ))
(t/refer-tupelo)

(defn unlazy
  [coll]
  (let [unlazy-item (fn [item]
                      (cond
                        (sequential? item) (vec item)
                        (map? item) (into {} item)
                        :else item))
        result    (postwalk unlazy-item coll) ]
    result ))

(defn file->xml [path]
  (with-open [rdr (-> path io/resource io/reader) ]
    (let [lazy-vals    (xml/parse rdr)
          eager-vals   (unlazy lazy-vals) ]
      eager-vals)))
(pprint (file->xml "books.xml"))

{:tag :catalog,
 :attrs {},
 :content
 [{:tag :book,
   :attrs {:id "bk101"},
   :content
   [{:tag :author, :attrs {}, :content ["Gambardella, Matthew"]}
    {:tag :title, :attrs {}, :content ["XML Developer's Guide"]}
    {:tag :genre, :attrs {}, :content ["Computer"]}
    {:tag :price, :attrs {}, :content ["44.95"]}
    {:tag :publish_date, :attrs {}, :content ["2000-10-01"]}
    {:tag :description,
     :attrs {},
     :content
     ["An in-depth look at creating applications\n      with XML."]}]}
  {:tag :book,
   :attrs {:id "bk102"},
   :content
   [{:tag :author, :attrs {}, :content ["Ralls, Kim"]}
    {:tag :title, :attrs {}, :content ["Midnight Rain"]}
    {:tag :genre, :attrs {}, :content ["Fantasy"]}
    {:tag :price, :attrs {}, :content ["5.95"]}
    {:tag :publish_date, :attrs {}, :content ["2000-12-16"]}
    {:tag :description,
     :attrs {},
     :content
     ["A former architect battles corporate zombies,\n      an evil sorceress, and her own childhood to become queen\n      of the world."]}]}
  {:tag :book,
   :attrs {:id "bk103"},
   :content .....

Since clojure.data.xml.Element implements clojure.lang.IPersistentMap, using (map? item) returns true.

Here is the sample data for books.xml

Please Note:

clojure.data.xml is different that clojure.xml. You may need to explore both libraries to find the one that fits your needs best.

You can also use crossclj.info to find api docs when needed:

Update:

Just a week or so after I saw this question I ran into an XML parsing problem just like this one that needed the unlazy function. You can now find unlazy in the Tupelo library.

Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
  • Hmm. Interesting. Thanks for the taking the time to clarify what is going on. – nrako Apr 04 '17 at 03:14
  • I wouldn't call this "immune" to doall; it rather seems as if it is another _level_ of lazy things, a lazy sequence of lazy sequences. – Svante Apr 06 '17 at 19:41
  • I haven't dug into the source code, but it appears that it is a "lazy map" type of structure, which is also seen with `datomic.query.EntityMap`. I think the problem is that `doall` is intended only for lazy sequences, not "lazy maps". – Alan Thompson Apr 06 '17 at 20:40