Summary:
I've written a very efficient function which returns both List.distinct
and a List
consisting of each element which appeared more than once and the index at which the element duplicate appeared.
Details:
If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List
(as ordering was significant) exactly once and returns a Tuple2
consisting of the original List
deduped (all duplicates after the first are removed; i.e. the same as invoking distinct
) and a second List
showing each duplicate and an Int
index at which it occurred within the original List
.
I have implemented the function twice based on the general performance characteristics of the Scala collections; filterDupesL
(where the L is for Linear) and filterDupesEc
(where the Ec is for Effectively Constant).
Here's the "Linear" function:
def filterDupesL[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(
remaining: List[A]
, index: Int =
0
, accumulator: (List[A], List[(A, Int)]) =
(Nil, Nil)): (List[A], List[(A, Int)]
) =
if (remaining.isEmpty)
accumulator
else
recursive(
remaining.tail
, index + 1
, if (accumulator._1.contains(remaining.head)) //contains is linear
(accumulator._1, (remaining.head, index) :: accumulator._2)
else
(remaining.head :: accumulator._1, accumulator._2)
)
val (distinct, dupes) = recursive(items)
(distinct.reverse, dupes.reverse)
}
An below is an example which might make it a bit more intuitive. Given this List of String values:
val withDupes =
List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")
...and then performing the following:
val (deduped, dupeAndIndexs) =
filterDupesL(withDupes)
...the results are:
deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))
And if you just want the duplicates, you simply map
across dupeAndIndexes
and invoke distinct
:
val dupesOnly =
dupeAndIndexs.map(_._1).distinct
...or all in a single call:
val dupesOnly =
filterDupesL(withDupes)._2.map(_._1).distinct
...or if a Set
is preferred, skip distinct
and invoke toSet
...
val dupesOnly2 =
dupeAndIndexs.map(_._1).toSet
...or all in a single call:
val dupesOnly2 =
filterDupesL(withDupes)._2.map(_._1).toSet
For very large List
s, consider using this more efficient version (which uses an additional Set
to change the contains
check in effectively constant time):
Here's the "Effectively Constant" function:
def filterDupesEc[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(
remaining: List[A]
, index: Int =
0
, seenAs: Set[A] =
Set()
, accumulator: (List[A], List[(A, Int)]) =
(Nil, Nil)): (List[A], List[(A, Int)]
) =
if (remaining.isEmpty)
accumulator
else {
val (isInSeenAs, seenAsNext) = {
val isInSeenA =
seenAs.contains(remaining.head) //contains is effectively constant
(
isInSeenA
, if (!isInSeenA)
seenAs + remaining.head
else
seenAs
)
}
recursive(
remaining.tail
, index + 1
, seenAsNext
, if (isInSeenAs)
(accumulator._1, (remaining.head, index) :: accumulator._2)
else
(remaining.head :: accumulator._1, accumulator._2)
)
}
val (distinct, dupes) = recursive(items)
(distinct.reverse, dupes.reverse)
}
Both of the above functions are adaptations of the filterDupes
function in my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._
.