Merge (group-by) huge sequences lazily in clojure

Question

EXAMPLE:

We have two time-series lazy sequences of map created by reading csv. The two lazy-sequences start at different days:

INPUT
 lazy-seq1
  ({:date "20110515" :val1 123}
   {:date "20110516" :val1 143}
   {:date "20110517" :val1 1153} ...)
 lazy-seq2
  ({:date "20110517" :val2 151}
   {:date "20110518" :val2 1330} ...)
EXPECTED OUTPUT
 lazy-seq3 
  ({:date "20110515" :vals {:val1 123}}
   {:date "20110516" :vals {:val1 143}}
   {:date "20110517" :vals {:val1 1153 :val2 151}}
   {:date "20110518" :vals {:val1 ... :val2 1330}}
  ...))

To be exact, type of :date is not string, but Jodatime coerced by clj-time and :date is sorted for each sequences.

The first choice will be using group-by function, but I guess that this cannot create lazy-seq. I believe that group-by needs eager evaluation.

The second choice will be using partition-by function, but I cannot apply this to my INPUTS because of lack of my closure skill.

Input seq is quite huge (~1GB per sequence) and I want to calculate many (~100) sequences at once. So, I want lazy evaluation to avoid Outofmemory error.

are your input sequences sorted by date? – leetwinski Jun 16 '16 at 14:45 — leetwinski, Jun 16 '16 at 14:45

leetwinski · Accepted Answer · 2016-06-16T15:24:43.177

if your items are sorted by date, you can easily make a lazy merge of them (like in the merge sort algorithm):

(defn merge-lazy [seq1 seq2]
  (cond (empty? seq1) seq2
        (empty? seq2) seq1
        (< (Integer/parseInt (:date (first seq1)))
           (Integer/parseInt (:date (first seq2)))) (cons (first seq1)
                                                      (lazy-seq (merge-lazy (rest seq1) seq2)))
        :else (cons (first seq2) (lazy-seq (merge-lazy seq1 (rest seq2))))))

it would yield a sorted lazy sequence by date:

user> (def seq1
        '({:date "20110515" :val1 123}
          {:date "20110516" :val1 143}
          {:date "20110517" :val1 1153}))
#'user/seq1
user> (def seq2 '({:date "20110517" :val2 151}
                  {:date "20110518" :val2 1330}))

user> (merge-lazy seq1 seq2)
({:date "20110515", :val1 123} {:date "20110516", :val1 143} 
 {:date "20110517", :val2 151} {:date "20110517", :val1 1153} 
 {:date "20110518", :val2 1330})

then you can just partition this resulting lazy seq by date (which also produces a lazy seq):

user> (partition-by :date (merge-lazy seq1 seq2))
(({:date "20110515", :val1 123}) 
 ({:date "20110516", :val1 143}) 
 ({:date "20110517", :val2 151} {:date "20110517", :val1 1153})
 ({:date "20110518", :val2 1330}))

so the next thing you do, is just process every group with map

if you have more input sequences, you can use the same strategy, just rewriting merge-lazy with variable args (or just reduce with merge-lazy: (reduce merge-lazy sequences) this would also produce a lazy seq of sequences' merge)

Merge (group-by) huge sequences lazily in clojure

1 Answers1