F#: removing duplicates from a seq is slow

Question

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<'a> but with a twist: I need the last duplicate from a run of duplicates to make it into the resulting sequence. For example, if I have a sequence [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)], and I am using fun ((x1, y1),(x2, y2)) -> x1=x2 to check for equality, the result I want to see is [("a", 1); ("b", 4); ("c", 5)]. The point of this function is that I have data points coming in, where sometimes data points legitimately have the same timestamp, but I only care about the latest one, so I want to throw out the preceding ones with the same timestamp. The function I have implemented is as follows:

let rec dedupeTakingLast equalityFn prev s = seq {
     match ( Seq.isEmpty s ) with
     | true -> match prev with 
                 | None -> yield! s
                 | Some value -> yield value
     | false ->
         match prev with 
         | None -> yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s) 
         | Some prevValue -> 
             if not (equalityFn(prevValue, (Seq.head s))) then 
                 yield prevValue
             yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s)
}

In terms of actually doing the job, it works:

> [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)] 
  |> dedupeTakingLast (fun ((x1, y1),(x2, y2)) -> x1=x2) None 
  |> List.ofSeq;;
val it : (string * int) list = [("a", 1); ("b", 4); ("c", 5)]

However, in terms of performance, it's a disaster:

> #time
List.init 1000 (fun _ -> 1) 
|> dedupeTakingLast (fun (x,y) -> x = y) None 
|> List.ofSeq
#time;;    
--> Timing now on    
Real: 00:00:09.958, CPU: 00:00:10.046, GC gen0: 54, gen1: 1, gen2: 1
val it : int list = [1]    
--> Timing now off

Clearly I'm doing something very dumb here, but I cannot see what it is. Where is the performance hit coming from? I realise that there are better implementations, but I am specifically interested in understanding why this implementation is so slow.

EDIT: FYI, managed to eke out a decent implementation in the functional style that relies on Seq. functions only. Performance is OK, taking about 1.6x the time of the imperative-style implementation by @gradbot below that uses iterators. It seems that the root of the problem is the use of Seq.head and Seq.tail in recursive calls in my original effort.

let dedupeTakingLastSeq equalityFn s = 
    s 
    |> Seq.map Some
    |> fun x -> Seq.append x [None]
    |> Seq.pairwise
    |> Seq.map (fun (x,y) -> 
            match (x,y) with 
            | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
            | (_,None) -> x
            | _ -> None )
    |> Seq.choose id

I wonder if this would help you at all: https://github.com/Spreads/Spreads Unfortunately I never had the time to test it. — s952163, Apr 07 '16 at 09:06
I'm surprised this new version runs so fast. Each Seq function is wrapping another Enumerator around the result so their are at least 5 layers. It would be interesting to see the assembly generated. — gradbot, Apr 08 '16 at 03:12
Quite. My question really could have been: "what does `Seq.*` have that my homebrew `seq { ... yield... }` doesn't?". — mt99, Apr 08 '16 at 11:07
One tiny optimization to your solution: you can simplify the last two passages from `Seq.map [function] |> Seq.choose id` to just `Seq.choose [function]`. — piaste, Apr 08 '16 at 12:46
Nice. That (very empirically) shaves about 10% off the time for the 5x10^6-sized input seq. — mt99, Apr 08 '16 at 13:15

score 6 · Answer 1 · answered Apr 06 '16 at 20:27

6

The problem is with how you use sequences. All those yields, heads and tails are spinning a web of iterators branching off of iterators, and when you finally materialize it when you call List.ofSeq, you're iterating through your input sequence way more than you should.

Each of those Seq.heads is not simply taking the first element of a sequence - it's taking the first element of the tail of a sequence of a tail of a sequence of tail of a sequence and so on.

Check this out - it'll count the times the element constructor is called:

let count = ref 0

Seq.init 1000 (fun i -> count := !count + 1; 1) 
|> dedupeTakingLast (fun (x,y) -> x = y) None 
|> List.ofSeq

Incidentally, just switching out all the Seqs to Lists makes it go instantly.

answered Apr 06 '16 at 20:27

scrwtp

13,437
2
26
30

The trouble is that I _do_ have quite a lot to stuff into this function (gigabytes of logs), so no can do on the list front -- keeping the input a seq allows this deduplication to process the input as it is generated by upstream code instead of waiting for it to complete and then creating a gigantic input list just for this function. – mt99 Apr 06 '16 at 22:01
@mt99: You can still make sequence-based solution work - using `pairwise`, `windowed` or `fold`. It's just this kind of recursive solution that doesn't go well with seqs. – scrwtp Apr 06 '16 at 23:28
@mt99: otherwise, you could feed your logs in as event stream and use Observables, or process the files in chunks and merge them back afterwards. – scrwtp Apr 06 '16 at 23:30

gradbot · Accepted Answer · 2016-04-07T17:36:23.840

The performance issue comes from the nested calls to Seq.tail. Here's the source code to Seq.tail

[<CompiledName("Tail")>]
let tail (source: seq<'T>) =
    checkNonNull "source" source
    seq { use e = source.GetEnumerator() 
          if not (e.MoveNext()) then 
              invalidArg "source" (SR.GetString(SR.notEnoughElements))
          while e.MoveNext() do
              yield e.Current }

If you call Seq.tail(Seq.tail(Seq.tail(...))) the compiler has no way of optimizing out the enumerators that are created by GetEnumerator(). Subsequent returned elements have to go through every nested sequence and enumerator. This results in every returned element having to bubble up through all previously created sequences and all of these sequences have to increment their internal state as well. The net result is a running time of O(n^2) instead of linear O(n).

Unfortunately there is currently no way to represent this in a functional style in F#. You can with a list (x::xs) but not for a sequence. Until the language gets better native support for sequences, don't use Seq.tail in recursive functions.

Using a single enumerator will fix the performance problem.

let RemoveDuplicatesKeepLast equals (items:seq<_>) =
    seq {
        use e = items.GetEnumerator()

        if e.MoveNext() then
            let mutable previous = e.Current

            while e.MoveNext() do
                if not (previous |> equals e.Current) then 
                    yield previous
                previous <- e.Current

            yield previous
    }

let test = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
let FirstEqual a b = fst a = fst b

RemoveDuplicatesKeepLast FirstEqual test
|> printf "%A"

// output
// seq [("a", 1); ("b", 4); ("c", 5)]

The first version of this answer has a recursive version of the above code without mutation.

This approach of using iterators and dash of the imperative style is, as is often the case, the solution to questions such as my original one. It is certainly effective and fast, so thank you for this. However, I would like to understand is why writing this in a seemingly lazy, functional style using recursion results in such a vast difference in performance. So while I've marked your answer as "useful" (because it certainly is, although to be honest, in the imperative world this is not a complicated problem), I am afraid this doesn't answer my question. — mt99, Apr 07 '16 at 16:26
Thanks, that is what I was looking for. In the meantime we all generated quite a few deduplicators. I got something (see EDIT) that runs in circa 1.6-2x the time of your imperative solution, so not unhappy, but still wondering whether it's possible to match iterators in functional style... — mt99, Apr 08 '16 at 01:09

score 4 · Answer 3 · answered Apr 06 '16 at 20:20

4

Seq.isEmpty, Seq.head and Seq.tail are slow because they all create a new Enumerator instance which it then calls into. You end up with a lot of GC.

Generally, Sequences are forward only, and if you use them 'like pattern matching for lists', the performance becomes really shoddy.

Looking a bit at your code... | None -> yield! s creates a new Enumerator even though we know s is empty. Every recursive call probably ends up creating a new IEnumerable that is then directly turned into an Enumerator from the call-site with yield!.

answered Apr 06 '16 at 20:20

Henrik

9,714
5
53
87

OK, I see the point you make about _'like pattern matching for lists'_... but I don't know/see how else I can reason about the item at the front of the sequence. The problem is, like I mention in my other comments here, is that I can't really go for lists here (input v large, so I want to stick to the sequential processing idea. So what to do? My function doesn't inherently need to "keep hold of" any bits of the preceding sequence apart from the one in the `prev` argument -- it looks like it should be perfectly doable with a seq. – mt99 Apr 06 '16 at 22:14
You should use `scan` or `fold` to keep track of state, then it's fast and the compiler can reprogram it to a while-loop or a tail-recursive call. – Henrik Apr 07 '16 at 08:55

score 3 · Answer 4 · answered Apr 07 '16 at 06:48

3

I'm also looking forward to a non-seq answer. Here's another solution:

let t = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
t |> Seq.groupBy fst |> Seq.map (snd >>  Seq.last)

I tested on a 1M list:

Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : seq<int * int> = seq [(2, 2); (1, 1)]

answered Apr 07 '16 at 06:48

s952163

6,276
4
23
47

Couldn't you make this more generic by making it a function that allows a groupBy equality operator to be passed in with something like: `let f equality = function t -> t|> (Seq.groupBy equality) |> Seq.map (snd >> Seq.last)` – Ringil Apr 07 '16 at 10:56
That is a nice _functional_ approach, but thre are two things about it that don't make it right. Firstly, it does rely on never seeing a `("b",_)` once the run of `"b"`'s has finished. Try `[("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5); ("b", 7)]` as input to see what I mean -- I'd like to see `[("a", 1); ("b", 4); ("c", 5); ("b", 7)]`, but this approach results in `[("a", 1); ("b", 7); ("c", 5) ]`. Then, since it uses `Seq.groupBy`, it is not a "pipe-like" seq processor, which is what I am looking for. Run it with `Seq.init 50000000 (fun i -> (1,i))` and look at the Fsi.exe memory usage, – mt99 Apr 07 '16 at 14:51
I suspected that could be an issue... I frequently bump into memory or CPU related bottlenecks as well. the solution you outlined is actually very useful for me. Sorry for free-loading. :-) – s952163 Apr 07 '16 at 23:45

score 2 · Answer 5 · answered Apr 06 '16 at 20:46

As the other answers have said, seq are really slow. However, the real question is why do you want to use a seq here? In particular, you start with a list and you want to traverse the entire list and you want to create a new list at the end. There doesn't appear to be any reason to use a sequence at all unless you want to use sequence specific features. In fact, the docs state that (emphasis mine):

A sequence is a logical series of elements all of one type. Sequences are particularly useful when you have a large, ordered collection of data but do not necessarily expect to use all the elements. Individual sequence elements are computed only as required, so a sequence can provide better performance than a list in situations in which not all the elements are used.

This depends on the size of your input. In my case it is rather large so generating an input list is not a great idea. I don't necessarily want to create a list in the end, I just used `List.ofSeq` here to "materialise" the output for illustration purposes. I would actually like to end up with a seq that I can pipe into the next function along. — mt99, Apr 06 '16 at 22:06

piaste · Answer 6 · 2016-04-07T21:19:05.980

2

To make efficient use of the input type Seq, one should iterate through each element only once and avoid creating additional sequences.

On the other side, to make efficient use of the output type List, one should make liberal use of the cons and tail functions, both of which are basically free.

Combining the two requirements leads me to this solution:

// dedupeTakingLast2 : ('a -> 'a -> bool) -> seq<'a> -> 'a list
let dedupeTakingLast2 equalityFn = 
  Seq.fold 
  <| fun deduped elem ->     
       match deduped with
       | [] -> [ elem ]
       | x :: xs -> if equalityFn x elem 
                      then elem :: xs
                      else elem :: deduped
  <| []

Note however, that the outputted list will be in reverse order, due to list prepending. I hope this isn't a dealbreaker, since List.rev is a relatively expensive operation.

Test:

List.init 1000 (id) 
|> dedupeTakingLast2 (fun x y -> x - (x % 10) = y - (y % 10))
|> List.iter (printfn "%i ")

// 999 989 979 969 etc...

edited Apr 07 '16 at 21:19

answered Apr 07 '16 at 21:08

piaste

2,038
11
20

This is properly fast! But it does sadly return the sequence in reverse, which is a shame as, as you say, `Seq.rev` is reasonably expensive. Cool effort, even though I can't quite understand it. – mt99 Apr 08 '16 at 00:43
I don't have access to a F# shell now, but if you tried this code with `Seq.rev` and found it too slow, do try it with `List.rev` instead. The latter should be significantly faster than the former. – piaste Apr 08 '16 at 01:07
I have tried it -- it's not bad, but is a bit slower than then thing in my question EDIT. Unfortunately, as I mentioned in other comments, I can't have lists -- I need a "pipe-like" processing function, so can't wait for the end of the data to do a `.rev`. – mt99 Apr 08 '16 at 01:15

hocho · Answer 7 · 2016-04-08T04:47:31.370

2

Here is an implementation using mapFold but requires passing in a value not equal to the last value. Eliminates the need to write a recursive function. Should run faster but not tested.

let dedupe notLast equalityfn (s:'a seq) =
    [notLast]
    |> Seq.append s
    |>  Seq.mapFold
            (fun prev item  -> 
                if equalityfn prev item 
                    then (None, item)
                    else (Some(prev), item))
            (Seq.head s)
    |>  fst
    |>  Seq.choose id

let items = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)] 

let unique =     
    dedupe ("", 0) (fun (x1, x2) (y1, y2) -> x1 = y1) items 

printfn "%A" unique

edited Apr 08 '16 at 04:47

answered Apr 07 '16 at 21:53

hocho

1,753
1
11
14

1

Have a look at my EDIT of the main question -- I figured out something similar. The thing I had to do to avoid having a `notLast` was to `Seq.map Some` to the input and then append a `None` to the end (as it's guaranteed to not be the same as `Some whateverWasTheLastElement`). Incidentally, while doing this, I came across a cool way to pick out all the `Some`s in a seq/list: `|> Seq.choose id`. – mt99 Apr 07 '16 at 23:13

score 2 · Answer 8 · answered Sep 27 '19 at 06:59

Bit of an old question here, but I'm just looking for old examples to demonstrate a new library that I have been working on. It's a replacement for System.Linq.Enumerable, but also it has a wrapper to replace F#'s Seq. It's not complete yet, but it's polyfill'd up to match the existing APIs (i.e. incomplete material just forwards to existing functionality).

It is available in on nuget here: https://www.nuget.org/packages/Cistern.Linq.FSharp/

So I have taken your modified Seq from the bottom of your answer and "converted" it to Cistern.Linq.FSharp (which is just a search and replace of "Seq." for "Linq.") And then compared it's runtime to your original. The Cistern version runs at well under 50% of the time (I get ~41%).

open System
open Cistern.Linq.FSharp
open System.Diagnostics

let dedupeTakingLastCistern equalityFn s = 
    s 
    |> Linq.map Some
    |> fun x -> Linq.append x [None]
    |> Linq.pairwise
    |> Linq.map (fun (x,y) -> 
            match (x,y) with 
            | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
            | (_,None) -> x
            | _ -> None )
    |> Linq.choose id

let dedupeTakingLastSeq equalityFn s = 
    s 
    |> Seq.map Some
    |> fun x -> Seq.append x [None]
    |> Seq.pairwise
    |> Seq.map (fun (x,y) -> 
            match (x,y) with 
            | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
            | (_,None) -> x
            | _ -> None )
    |> Seq.choose id

let test data which f =
    let iterations = 1000

    let sw = Stopwatch.StartNew ()
    for i = 1 to iterations do
        data
        |> f (fun x y -> x = y)
        |> List.ofSeq    
        |> ignore
    printfn "%s %d" which sw.ElapsedMilliseconds


[<EntryPoint>]
let main argv =
    let data = List.init 10000 (fun _ -> 1)

    for i = 1 to 5 do
        test data "Seq" dedupeTakingLastSeq
        test data "Cistern" dedupeTakingLastCistern

    0

score 1 · Answer 9 · answered Apr 07 '16 at 00:25

Here is a pretty fast approach which uses library functions rather than Seq expressions.

Your test runs in 0.007 seconds on my PC.

It has a pretty nasty hack for the first element that doesn't work brilliantly that could be improved.

let rec dedupe equalityfn prev (s:'a seq) : 'a seq =
    if Seq.isEmpty s then
        Seq.empty
    else
        let rest = Seq.skipWhile (equalityfn prev) s
        let valid = Seq.takeWhile (equalityfn prev) s
        let valid2 = if Seq.isEmpty valid  then Seq.singleton prev else (Seq.last valid) |> Seq.singleton
        let filtered = if Seq.isEmpty rest then Seq.empty else dedupe equalityfn (Seq.head rest) (rest)
        Seq.append valid2 filtered

let t = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
        |> dedupe (fun (x1, y1) (x2, y2) -> x1=x2) ("asdfasdf",1)
        |> List.ofSeq;;

#time
List.init 1000 (fun _ -> 1)
|> dedupe (fun x y -> x = y) (189234784)
|> List.ofSeq
#time;;
--> Timing now on

Real: 00:00:00.007, CPU: 00:00:00.006, GC gen0: 0, gen1: 0
val it : int list = [189234784; 1]

--> Timing now off

This one has very interesting performance when there are no duplicates. Try `Seq.init 500 (fun i -> (i,i)) |> dedupe (fun a b -> fst a = fst b) (-1,0) |> Seq.last`. Takes three seconds on my box. My tests for other implementations on this page run at about 1.2s (iterator) 2.3s (my effort in the EDIT to the question) for 5000000-long inputs. — mt99, Apr 08 '16 at 00:56

F#: removing duplicates from a seq is slow

9 Answers9

Linked