Parallel processing of huge JSON in Clojure

Question

Our data is coming from DB, we need to apply some business logic before sending, so we are converting to Clojure map format for process. The data is multi-level nested map, we have to process each key and value in all level of map, for processing we are using clojure.walk.postwalk. Because of huge data, it is taking more time.

In data, the first level contains some 5 keys, and value of each key may be an another map or vector. Likewise it may go to 10 to 15 levels. We tried pmap in the first level, but it is slow. If the data is simple vector we can use partition, but because of nested complex structure it is very hard to use partition.

Is there anyway to make this process faster, basically our requirement is to apply a function to each key and a separate function to each value.

Your question is a bit unclear. Where does your JSON originate? I'd assume that it originates from some sort of Clojure data-structure, and that you're using cheshire or something to transform edn to JSON? Why can't you do the processing of keys and values before translating to JSON? Or are you receiving tons of JSON from somewhere, and you need to process it after having transformed it to edn? — slipset, Apr 12 '18 at 12:54
`pmap` is great and works just like `map`. If each individual bit of data isn't large enough to justify being it "own job", you can `partition` the data into jobs. — Carcigenicate, Apr 12 '18 at 13:04
I'd post an example of how I like to use `partition`/`pmap`, but I'd be making wild assumptions about your data. Post enough context for us to work with. — Carcigenicate, Apr 12 '18 at 14:43
Actually, thinking it over again, idk if `pmap` is actually applicable here. Parallelizing recursion isn't easy. — Carcigenicate, Apr 12 '18 at 16:55

score 2 · Answer 1 · answered Apr 16 '18 at 20:53

I had good luck with a two-pass approach using futures. Basically, you walk the entire tree once, wrapping each transformation in a future. Then you walk the tree a second time, dereferencing each future. I thought a two-pass would be too costly, but I tried it with a fairly large nested tree, and it was significantly faster than just using postwalk.

The test case I'm using is finding the nth prime to simulate an expensive operation. The tree is a nested map of keyword/number pairs. All the numbers that are found are transformed into the 250th prime that's found.

The test data I'm using is this mess:

(def giant-tree
  {:a 28,
   :e {:d {:a 37,
           :e 92,
           :d {:b {:c 91,
                   :d {:e 12,
                       :a 22,
                       :d {:e {:a {:a 53}, :d 98},
                           :d {:b 23,
                               :a {:a {:a 97},
                                   :c {:c 47,
                                       :d {:c {:d {}},
                                           :e {:e 57,
                                               :d {:a 57,
                                                   :d 42,
                                                   :e {:d {:e 64,
                                                           :a {:d {:b 14,
                                                                   :d {:c {},
                                                                       :b {},
                                                                       :a {:b {:b 86,
                                                                               :a {:d 86, :c 52},
                                                                               :d {:d {:a {},
                                                                                       :c {:a {}, :c 0, :b {:c 29}},
                                                                                       :d 88},
                                                                                   :c {:c 88},
                                                                                   :a {:c 89, :a {:a 42, :c 62}},
                                                                                   :b 30},
                                                                               :e 60},
                                                                           :c {:e 18,
                                                                               :d {:e {}, :d 70, :b 90},
                                                                               :b {:a {:a 1}}}}},
                                                                   :e 47,
                                                                   :c 19},
                                                               :c {:a 56,
                                                                   :c {:a {:a 73,
                                                                           :e 39,
                                                                           :d 21,
                                                                           :b {:e {:d {}, :b 82, :c 12, :a 80},
                                                                               :a {:a 22,
                                                                                   :e {:b {:b {:b 20, :a 50}}, :c 23},
                                                                                   :b 55,
                                                                                   :d 80},
                                                                               :c 13}},
                                                                       :e 15},
                                                                   :b 68,
                                                                   :d 58},
                                                               :a 49},
                                                           :b 5},
                                                       :c 38}},
                                               :a {:a {:d 35, :a 99}},
                                               :c {:d {}}},
                                           :b {},
                                           :d 95}}},
                               :d {:b {:c 99}, :c 83, :e 61, :d 55},
                               :c {:b {:c 97,
                                       :a {:a {:b 86, :a {}, :e {:a 52, :c 20, :e 20}, :d 49}, :c 62},
                                       :d {:c 97,
                                           :d {:d {:d {:a 46, :c 90, :d {}, :e 88}, :e {:a 14, :c 48}},
                                               :c {},
                                               :a 87,
                                               :e 66}},
                                       :e 9}}}},
                       :b 64},
                   :a 4,
                   :e 19},
               :a {},
               :e 9}}}})

And I'm using Criterium for benchmarking.

This is the code that I'm testing with:

(ns fast-tree-transform.fast-tree-transform
  (:require [fast-tree-transform.test-data :as td]

            [clojure.walk :as w]

            [criterium.core :as c]))

(def default-price 250)

(defn prime? [n]
  (not
    (or (zero? n)
        (some #(zero? (rem n %)) (range 2 n)))))

(defn nth-prime [n]
  (nth (filter prime? (range))
       n))

(defn expensive-transform [e]
  (if (number? e)
    (nth-prime default-price)
    e))

; ----- Simple usage without any parallel aspect
(defn transform-data [nested-map]
  (w/postwalk expensive-transform nested-map))

; ----- Puts each call in a future so it's run in a thread pool
(defn future-transform [e]
  (if (number? e)
    (future (expensive-transform e))
    e))

; ----- The second pass to resolve each future
(defn resolve-transform [e]
  (if (future? e)
    @e
    e))

; ----- Tie them both together
(defn future-transform-data [nested-map]
  (->> nested-map
      (w/postwalk future-transform)
      (w/postwalk resolve-transform)))

The two main functions of interest are transform-data and future-transform-data.

Here are the results:

(c/bench
  (transform-data td/giant-tree))

Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.085124 sec
    Execution time std-deviation : 38.049523 ms
   Execution time lower quantile : 1.062980 sec ( 2.5%)
   Execution time upper quantile : 1.193548 sec (97.5%)
                   Overhead used : 3.088370 ns

Found 4 outliers in 60 samples (6.6667 %)
    low-severe   4 (6.6667 %)
 Variance from outliers : 22.1802 % Variance is moderately inflated by outliers

(c/bench
  (future-transform-data td/giant-tree))

Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 526.771107 ms
    Execution time std-deviation : 14.202895 ms
   Execution time lower quantile : 513.002517 ms ( 2.5%)
   Execution time upper quantile : 568.856393 ms (97.5%)
                   Overhead used : 3.088370 ns

Found 5 outliers in 60 samples (8.3333 %)
    low-severe   1 (1.6667 %)
    low-mild     4 (6.6667 %)
 Variance from outliers : 14.1940 % Variance is moderately inflated by outliers

You can see it's about twice as fast.

score 0 · Answer 2 · answered Apr 16 '18 at 05:55

Depending on the nature of the data (e.g. the number of first-level keys and how balanced the nesting is under these keys) and your hardware (number of CPU cores), it is possible that the approach you've tried (pmap at the first level) is the best you can do.

A relatively simple approach to parallelizing over the nested map structure is, in essence, just to "flatten" the map so that each key is actually a vector of keys that represents the path to the value (a leaf in the original nested map). For example:

(defn extract-keys
  "Returns a seq of vectors that are the paths of keys to the leaves of map m."
  [m]
  (mapcat (fn [[k v]]
            (if (map? v)
              (map #(cons k %)
                   (extract-keys v))
              [[k]]))
          m))

(def data {:a {:b {:c {:d [1 2] :e [3 4 5 6]}
                   :f [7]}
               :g [8 9 10]}
           :h [11 12 13 14 15 16]})

;; Prints ((:a :b :c :d) (:a :b :c :e) (:a :b :f) (:a :g) [:h])
(println (extract-keys data))

You can then use pmap over this flattened map:

(defn- map-leaves
  [f m]
  (->> (extract-keys m)
       (pmap #(vector % (f (get-in data %))))
       (reduce (fn [m [k v]]
                 (assoc-in m k v))
               {})))

;; Prints {:a {:b {:c {:d 3, :e 18}, :f 7}, :g 27}, :h 81}
(println (map-leaves #(apply + %) data))

This could be straightforwardly modified to mutate the keys (as well as the values), or to partition the [k v] pairs before the pmap to reduce parallelization overhead. Of course, the flattening/unflattening will also have quite a bit of overhead, so whether this will work out to be faster than what you've already tried depends on the nature of the data, your hardware, and the transformation.

score 0 · Answer 3 · answered Apr 16 '18 at 11:29

0

you can use https://github.com/clojure/data.json json/read-str will do the trick. + you can send data as string from db. no? :) and yet again you can use pr-str & json/read-str combination.

answered Apr 16 '18 at 11:29

Kaki Tk

31
1
7

Parallel processing of huge JSON in Clojure

3 Answers3