1

I'm parsing some XML data from Stack Exchange using clojure.data.xml, for example if I parse Votes data it returns a LazySeq containing a HashMap for each row of data.

What I am trying to do is to get the values associated with only certain keys, for each row,e.g., (get votes [:Id :CreationDate]). I've tried numerous things, most of them leading to casting errors.

The closest I could get to what I need is using (doall (map get votes [:Id :CreationDate])). However, the problem I am experiencing now is that I cannot seem to return more than just the first row (i.e. (1 2011-01-19T00:00:00.000))

Here is a MCVE that can be run on any Clojure REPL, or on Codepad online IDE.

Ideally I would like to return some kind of collection or map which contains the values I need for each row, the end goal is to write to something like a CSV file or such. For example a map like

(1 2011-01-19T00:00:00.000
 2 2011-01-19T00:00:00.000
 3 2011-01-19T00:00:00.000
 4 2011-01-19T00:00:00.000)
(def votes '({:Id "1",
              :PostId "2",
              :VoteTypeId "2",
              :CreationDate "2011-01-19T00:00:00.000"}
             {:Id "2",
              :PostId "3",
              :VoteTypeId "2",
              :CreationDate "2011-01-19T00:00:00.000"}
             {:Id "3",
              :PostId "1",
              :VoteTypeId "2",
              :CreationDate "2011-01-19T00:00:00.000"}
             {:Id "4",
              :PostId "1",
              :VoteTypeId "2",
              :CreationDate "2011-01-19T00:00:00.000"}))

  (println (doall (map get votes [:Id :CreationDate])))

Additional detail: If this is of any help/interest, the code I am using to get the above lazy seq is as follows:

(ns se-datadump.read-xml
  (require
    [clojure.data.xml :as xml])

(def xml-votes
  "<votes><row Id=\"1\" PostId=\"2\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" />  <row Id=\"2\" PostId=\"3\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" />  <row Id=\"3\" PostId=\"1\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" />  <row Id=\"4\" PostId=\"1\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" /></votes>")

(defn se-xml->rows-seq
  "Returns LazySequence from a properly formatted XML string,
  which contains a HashMap for every <row> element with each of its attributes.
  This assumes the standard Stack Exchange XML format, where a parent element contains
  only a series of <row> child elements with no further hierarchy."
  [xml-str]
  (let [xml-records (xml/parse-str xml-str)]
        (map :attrs (-> xml-records :content))))

; this returns a map identical as in the MCVE:
(def votes (se-xml->rows-seq xml-votes)
Phrancis
  • 2,222
  • 2
  • 27
  • 40
  • I'm not sure I understand your intention completely. Could you maybe provide a manually created sample result? than it's easier to tell. – Anton Harald Aug 28 '16 at 22:29
  • @AntonHarald I added an example desired result, hope this helps make it more clear. – Phrancis Aug 28 '16 at 22:33

2 Answers2

3

You apparently need juxt:

(map (juxt :Id :CreationDate) votes)
;; => (["1" "2011-01-19T00:00:00.000"] ["2" "2011-01-19T00:00:00.000"] ["3" "2011-01-19T00:00:00.000"] ["4" "2011-01-19T00:00:00.000"])

If you need a map out of it:

(into {} (map (juxt :Id :CreationDate) votes))
;; => {"1" "2011-01-19T00:00:00.000", "2" "2011-01-19T00:00:00.000", "3" "2011-01-19T00:00:00.000", "4" "2011-01-19T00:00:00.000"}
Yuri Steinschreiber
  • 2,648
  • 2
  • 12
  • 19
2

First of all, let me explain, what the piece of code you suggest in the CodePad is actually doing. I doubt that it's the thing you are intending to do:

(println (doall (map get votes [:Id :CreationDate])))

The crucial part is: (map get votes [:Id :CreationDate]) This maps over two collections: the lazy sequence 'votes' and a vector. Whenever mapping over more than one collection, the returned lazy sequence will be as long as the shortest collection provided.
For instance one can map over a finite collection and an infinite sequence:

(map + (range) [1 2 3])
;; (0 3 5)

This explains why yours result is only two items long:

(map get votes [:Id :CreationDate])

reduces to:

((get (votes 0) ([:Id :CreationDate] 0)
 (get (votes 1) ([:Id :CreationDate] 1))

reduces to:

((get {:Id "1",
       :PostId "2",
       :VoteTypeId "2",
       :CreationDate "2011-01-19T00:00:00.000"} :Id)
 (get {:Id "2",
       :PostId "3",
       :VoteTypeId "2",
       :CreationDate "2011-01-19T00:00:00.000"} :CreationDate))

reduces finally to:

(1 2011-01-19T00:00:00.000)

This is just for understanding purpose. If the compiler does exactly these steps, is another question.

doall is not necessary here, since println already does this implicitly.


As already noted. In your case you'd better use juxt and only map over votes. If you really want to have the sample output you additionally need to flatten the output:

(flatten (map (juxt :Id :CreationDate) votes))
Anton Harald
  • 5,772
  • 4
  • 27
  • 61