Understanding Clojure Transducer Performance

Question

At a high level I understood that using a transducer does not create any intermediate data structures whereas a long chain of operations via ->> does and thus the transducer method is more performant. This is proven out as true in one of my examples below. However when I add clojure.core.async/chan to the mix I do not get the same performance improvement I expect. Clearly there is something that I don't understand and I would appreciate an explanation.

(ns dev
  (:require [clojure.core.async :as async]

            [criterium.core :as crit]))

;; Setup some toy data.
(def n 1e6)
(def data (repeat n "1"))


;; Reusable thread-last operation (the "slower" one).
(defn tx [x]
  (->> x
       (map #(Integer. %))
       (map inc) (map inc) (map inc) (map inc) (map inc) (map inc)
       (map inc) (map inc) (map inc) (map inc) (map inc)))

;; Reusable transducer (the "faster" one).
(def xf (comp
          (map #(Integer. %))
          (map inc) (map inc) (map inc) (map inc) (map inc) (map inc)
          (map inc) (map inc) (map inc) (map inc) (map inc)))

;; For these first two I expect the second to be faster and it is.
(defn nested []
  (last (tx data)))

(defn into-xf []
  (last (into [] xf data)))

;; For the next two I again expect the second to be faster but it is NOT.
(defn chan-then-nested []
  (let [c (async/chan n)]
    (async/onto-chan! c data)
    (->> c
         (async/into [])
         async/<!!
         tx
         last)))

(defn chan-xf []
  (let [c (async/chan n xf)]
    (async/onto-chan! c data)
    (->> c
         (async/into [])
         async/<!!
         last)))

(comment

  (crit/quick-bench (nested)) ; 1787.672 ms
  (crit/quick-bench (into-xf)) ; 822.8626 ms
  (crit/quick-bench (chan-then-nested)) ; 1535.628 ms
  (crit/quick-bench (chan-xf)) ; 2072.626 ms

  ;; Expected ranking fastest to slowest
  ;; into-xf
  ;; nested
  ;; chan-xf
  ;; chan-then-nested

  ;; Actual ranking fastest to slowest
  ;; into-xf
  ;; chan-then-nested
  ;; nested
  ;; chan-xf

  )

In the end there are two results I don't understand. First, why is using a transducer with a channel slower than reading everything off the channel and then doing nested maps? It appears that the "overhead", or whatever it is, of using a transducer with a channel is so much slower that it overwhelms the gains of not creating intermediate data structures. Second, and this one was really unexpected, why is it faster to put the data onto a channel and then take it off and then use the nested map technique than it is to not do the channel dance and just use the nested map technique? (Said shorter, why is chan-then-nested faster than nested?) Could all or some of this just be an artifact of the benchmarking or randomness? (I have run quick-bench several times for each of these with the same results.) I'm wondering if it has something to do with into calling transduce whereas in the channel version is not implemented the same way at all. The transducer provides the same interface for applying a transformation across vectors or channels but how that transformation is applied is different; and that difference makes all the difference.

amalloy · Accepted Answer · 2021-01-17T06:54:08.390

Some remarks on your methodology:

It is very unusual to have a channel with a buffer size of 1 million. I would not expect benchmarks derived from such usage to have much applicability to real-world programs. Just use a buffer size of 1. This is plenty for this application to succeed, and more closely mirrors real-world usage.
You don't need to pick such a gigantic n. If your function runs more quickly, criterium can take more samples, getting a more accurate estimate of its average time. n=100 is plenty.

After making those changes, here is the benchmark data I see:

Evaluation count : 14688 in 6 samples of 2448 calls.
             Execution time mean : 39.978735 µs
    Execution time std-deviation : 1.238587 µs
   Execution time lower quantile : 38.870558 µs ( 2.5%)
   Execution time upper quantile : 41.779784 µs (97.5%)
                   Overhead used : 10.162171 ns
Evaluation count : 20094 in 6 samples of 3349 calls.
             Execution time mean : 30.557295 µs
    Execution time std-deviation : 562.641738 ns
   Execution time lower quantile : 29.936152 µs ( 2.5%)
   Execution time upper quantile : 31.330094 µs (97.5%)
                   Overhead used : 10.162171 ns
Evaluation count : 762 in 6 samples of 127 calls.
             Execution time mean : 740.642963 µs
    Execution time std-deviation : 176.879454 µs
   Execution time lower quantile : 515.588780 µs ( 2.5%)
   Execution time upper quantile : 949.109898 µs (97.5%)
                   Overhead used : 10.162171 ns

Found 2 outliers in 6 samples (33.3333 %)
    low-severe   1 (16.6667 %)
    low-mild     1 (16.6667 %)
 Variance from outliers : 64.6374 % Variance is severely inflated by outliers
Evaluation count : 816 in 6 samples of 136 calls.
             Execution time mean : 748.782942 µs
    Execution time std-deviation : 7.157018 µs
   Execution time lower quantile : 740.139618 µs ( 2.5%)
   Execution time upper quantile : 756.102312 µs (97.5%)
                   Overhead used : 10.162171 ns

The key takeaways are:

Async overhead dominates actual processing time. Both of the channel versions are much slower than either non-channel version, so we no longer have to worry about "why is it faster to put the whole thing onto a channel and then take it off again".
The difference between chan-then-nested and chan-xf is much smaller than in your version. chan-xf is still a tiny bit slower, but easily within one standard deviation: not really a remarkable result.

Understanding Clojure Transducer Performance

1 Answers1