9

In many resources on Reducers (like the canonical blog post by Rich Hickey), it's claimed that reducers are faster than the regular collection functions ((map ... (filter ...)) etc.) because there is less overhead.

What is the extra overhead that's avoided? IIUC even the lazy collection functions end up walking the original sequence just once. Is the difference in the details of how intermediate results are computed?

Pointers to relevant places in the implementation of Clojure that help understand the difference will be most helpful too

zlatanski
  • 815
  • 1
  • 8
  • 13

1 Answers1

7

I think one key insight is in the following passage from the original blog post:

(require '[clojure.core.reducers :as r])
(reduce + (r/filter even? (r/map inc [1 1 1 2])))
;=> 6

That should look familiar - it's the same named functions, applied in the same order, with the same arguments, producing the same result as the Clojure's seq-based fns. The difference is that, reduce being eager, and these reducers fns being out of the seq game, there's no per-step allocation overhead, so it's faster. Laziness is great when you need it, but when you don't you shouldn't have to pay for it.

The realisation of a lazy sequence comes with a (linear) allocation cost: every time another element from the lazy seq is realised, the rest of the seq is stored in a new thunk, and the representation of such a ‘thunk’ is a new clojure.lang.LazySeq object.

I believe those LazySeq objects are the allocation overhead referred to in the quote. With reducers there is no gradual realisation of lazy seq elements and therefore no instantiation of LazySeq thunks at all.

glts
  • 21,808
  • 12
  • 73
  • 94
  • Thank you - this sounds more correct than the other answer to me. The other answer implies the whole seq is realized between each step, which can't be true. This allocation cost is something that I understand makes more sense and it is a cost we don't want to pay which is what reducers are faster. – zlatanski Mar 15 '16 at 22:00
  • 1
    I suppose you could read the first answer this way… I think it may be an unintended reading, but if it suggests itself strongly to you, then sure, that's misleading. Lazy seq processing happens either one element or one chunk (= a block of up to 32 elements) at a time, and new elements or chunks are realized when required. The correct point I thought the original answer was trying to make is that when you stack lazy seq transformations, each layer needs to allocate its own lazy seq object, with one "cell" per element or chunk (though of course they will be realized lazily). – Michał Marczyk Mar 16 '16 at 00:23
  • @MichałMarczyk: the first answer said "When you combine operations like map, filter then those functions iterate over a collection and **return a new collection which is then passed to the next function**" which sounds unambiguously wrong. nha also kept defending this single point in comments so I gave up arguing. Later on it seems that he/she patched up the answer to be more general. – zlatanski Mar 16 '16 at 12:54
  • 1
    They actually do each return a new collection, it's just that those collections are lazy seq objects, so they come into life in unrealized form and are only realized "from the outside in" once elements are pulled from the "outermost" lazy seq object. – Michał Marczyk Mar 16 '16 at 14:33
  • 1
    Admittedly the sentence you quoted may seem to suggest that map & Co. iterate over their inputs and *then* return new collections, which is not true – they return new collections immediately, embedding iteration logic within them (to be called when actual elements are requested). – Michał Marczyk Mar 16 '16 at 14:40