3

Taking the three functions below, implemented in Haskell and Clojure respectively:

f :: [Int] -> Int
f = foldl1 (+) . map (*7) . filter even

(defn f [coll]
  ((comp
    (partial reduce +)
    (partial map #(* 7 %)
    (partial filter even?)) coll))

(defn f [coll]
  (transduce
    (comp
      (filter even?)
      (map #(* 7 %)))
     + coll))

when they are applied to a list like [1, 2, 3, 4, 5] they all return 42. I know the machinery behind the first 2 is similar since map is lazy in Clojure, but the third one uses transducers. Could someone show the intermediate steps for the execution of these functions?

Fuad Saud
  • 3,086
  • 1
  • 15
  • 16
  • 1
    I don't think `reduce` is a standard Haskell function. Although I suppose you probably mean one of [`foldr1`](https://hackage.haskell.org/package/base-4.9.0.0/docs/Data-Foldable.html#v:foldr1) or [`foldl1`](https://hackage.haskell.org/package/base-4.9.0.0/docs/Data-Foldable.html#v:foldl1). – Alec Sep 04 '16 at 02:48
  • You're correct, I mixed them up. – Fuad Saud Sep 04 '16 at 02:50
  • 1
    Depending on whether or not you compile with optimizations, the first is subject to stream fusion, so it will almost certainly end up running as one tight loop (no extra intermediate allocations). I believe that transducers give you similar (if not as powerful / no overhead) benefits, so the third example should run sort of like the first. On the other hand, the second almost certainly is much less efficient - you'll end up building intermediate lists which are anyways garbage collected immediately. – Alec Sep 04 '16 at 03:10
  • Ok, so, I knew this was how Haskell behaved by default but I didn't know it was called "stream fusion". That's nice to know. Actually I thought that's how Clojure behaved as well, but it seems I was wrong? Anyway, is it valid to say: 1st and 3rd are `O(n)` while 2nd is `O(3n)`? – Fuad Saud Sep 04 '16 at 03:20
  • Well I wouldn't be so sure about that... the benefits and drawbacks are in space-complexity here. So the time benefits are observable through the time garbage collection takes, which isn't really something so predictable. – Alec Sep 04 '16 at 03:37
  • You're misusing big-O notation. Garbage collection is definitely one concern; another is cache utilization. If you build a giant list, most of it will fall out of cache before you're ready to start the next transformation. @Alec, this is a form of "short cut deforestation" called foldr/build fusion. Stream fusion usually refers to a different technique with similar goals. – dfeuer Sep 04 '16 at 09:02
  • 1
    @FuadSaud Any O(n) function is O(kn) for evey `k` positive natural number so there is *zero* distinction between O(n) and O(3n) or O(1000n). If you want to say "The code X performs *exactly* `n` computational steps/memory allocations/whatever thing you want to measure" you should say "The code X has complexity `n` (or `n+k` if you want), while code Y has complexity `3n+k`", where both functions belong to the set O(n) which is the *same set of functions* as O(3n) and O(kn) for every positive `k`. – Bakuriu Sep 04 '16 at 10:16
  • @dfeuer Thanks for the correction! Where _is_ stream fusion actually used? [This](http://code.haskell.org/~dons/papers/icfp088-coutts.pdf) paper had me thinking it was applied to lists for GHC. There's also [this](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/haskell-beats-C.pdf) paper that talks about `vector`, but now I'm not sure about anything anymore! I'll have to actually read the source! – Alec Sep 04 '16 at 15:23
  • @Alec, the `vector` package is the big one I know about, which is sadly greatly complicated by various factors, but I think there's also one called `stream-fusion` or something. – dfeuer Sep 04 '16 at 16:48
  • @Bakuriu thanks for the correction, that's what I wanted to convey – Fuad Saud Sep 04 '16 at 18:43
  • @dfeuer IIRC `stream-fusion` is literally the implementation of the first stream fusion paper I referenced above. And that package hasn't been updated since 2013. :S Also, slightly off topic, but I just realized you are the person to ask for this: is there any literature/documentation on the fusion techniques in `containers`? – Alec Sep 04 '16 at 19:54
  • 1
    @Alec, I don't think there's any documentation, and I don't know about literature. `containers` doesn't have any sort of general fusion framework; it just implements a few simple fusion laws like `map/map`, `traverse/map`, `map/reverse`, etc. If you want more, let me know on GitHub. – dfeuer Sep 04 '16 at 20:21

1 Answers1

4

The intermediate steps between the second and third example are the same for this specific example. This is due to the fact that map and filter are implemented as lazy transformations of a sequence into a sequence, as you’re no-doubt already aware.

The transducer versions of map and filter are defined using the same essential functionality as the not-transducer versions, except that the way they “conj" (or not, in the case of filter) onto the result stream is defined elsewhere. Indeed, if u look at the source for map, there are explicit data-structure construction functions in use, whereas the transducer variant uses no such functions -- they are passed in via rf. Explicitly using cons in the non-transducer versions means they're always going to be dealing with sequences

IMO, the main benefit of using transducers is that you have the ability to define the process that you're doing away from the thing which will use your process. Therefore perhaps a more interesting rewrite of your third example may be:

(def process (comp (filter even)
                   (map #(* 7 %))))

(defn f [coll] (transduce process + collection))

Its an exercise to the application author to decide when this sort of abstraction is necessary, but it can definitely open an opportunity for reuse.


It may occur to you that you can simply rewrite

(defn process [coll]
  ((comp
    (partial map #(* 7 %)
    (partial filter even?)) coll))

(reduce + (process coll))

And get the same effect; this is true. When your input is always a sequence (or always the same kind of stream / you know what kind of stream it will be) there's arguably not a good reason to create a transducer. But the power of reuse can be demonstrated here (assume process is a transducer)

(chan 1 process)  ;; an async channel which runs process on all inputs

(into [] process coll)  ;; writing to a vector

(transduce + process coll)  ;; your goal

The motivation behind transducers was essentially to stop having to write new collection functions for different collection types. Rich Hickey mentions his frustration writing functions like map< map> mapcat< mapcat>, and so on in the core async library -- what map and mapcat are, is already defined, but because they assume that they operate on sequences (that explicit cons I linked above), they cant be applied to asnychronous channels. But channels can supply their own rf in the transducer version to let them reuse these functions.

matrix10657
  • 503
  • 4
  • 10