I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<'a>
but with a twist: I need the last duplicate from a run of duplicates to make it into the resulting sequence. For example, if I have a sequence [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
, and I am using fun ((x1, y1),(x2, y2)) -> x1=x2
to check for equality, the result I want to see is [("a", 1); ("b", 4); ("c", 5)]
. The point of this function is that I have data points coming in, where sometimes data points legitimately have the same timestamp, but I only care about the latest one, so I want to throw out the preceding ones with the same timestamp. The function I have implemented is as follows:
let rec dedupeTakingLast equalityFn prev s = seq {
match ( Seq.isEmpty s ) with
| true -> match prev with
| None -> yield! s
| Some value -> yield value
| false ->
match prev with
| None -> yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s)
| Some prevValue ->
if not (equalityFn(prevValue, (Seq.head s))) then
yield prevValue
yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s)
}
In terms of actually doing the job, it works:
> [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
|> dedupeTakingLast (fun ((x1, y1),(x2, y2)) -> x1=x2) None
|> List.ofSeq;;
val it : (string * int) list = [("a", 1); ("b", 4); ("c", 5)]
However, in terms of performance, it's a disaster:
> #time
List.init 1000 (fun _ -> 1)
|> dedupeTakingLast (fun (x,y) -> x = y) None
|> List.ofSeq
#time;;
--> Timing now on
Real: 00:00:09.958, CPU: 00:00:10.046, GC gen0: 54, gen1: 1, gen2: 1
val it : int list = [1]
--> Timing now off
Clearly I'm doing something very dumb here, but I cannot see what it is. Where is the performance hit coming from? I realise that there are better implementations, but I am specifically interested in understanding why this implementation is so slow.
EDIT: FYI, managed to eke out a decent implementation in the functional style that relies on Seq.
functions only. Performance is OK, taking about 1.6x the time of the imperative-style implementation by @gradbot below that uses iterators. It seems that the root of the problem is the use of Seq.head
and Seq.tail
in recursive calls in my original effort.
let dedupeTakingLastSeq equalityFn s =
s
|> Seq.map Some
|> fun x -> Seq.append x [None]
|> Seq.pairwise
|> Seq.map (fun (x,y) ->
match (x,y) with
| (Some a, Some b) -> (if (equalityFn a b) then None else Some a)
| (_,None) -> x
| _ -> None )
|> Seq.choose id