One way is this
list.distinct.size != list.size
Is there any better way? It would have been nice to have a containsDuplicates
method
One way is this
list.distinct.size != list.size
Is there any better way? It would have been nice to have a containsDuplicates
method
You can also write:
list.toSet.size != list.size
But the result will be the same because distinct
is already implemented with a Set
. In both case the time complexity should be O(n): you must traverse the list and Set
insertion is O(1).
Assuming "better" means "faster", see the alternative approaches benchmarked in this question, which seems to show some quicker methods (although note that distinct uses a HashSet and is already O(n)). YMMV of course, depending on specific test case, scala version etc. Probably any significant improvement over the "distinct.size" approach would come from an early-out as soon as a duplicate is found, but how much of a speed-up is actually obtained would depend strongly on how common duplicates actually are in your use-case.
If you mean "better" in that you want to write list.containsDuplicates
instead of containsDuplicates(list)
, use an implicit:
implicit def enhanceWithContainsDuplicates[T](s:List[T]) = new {
def containsDuplicates = (s.distinct.size != s.size)
}
assert(List(1,2,2,3).containsDuplicates)
assert(!List("a","b","c").containsDuplicates)
I think this would stop as soon as a duplicate was found and is probably more efficient than doing distinct.size
- since I assume distinct
keeps a set as well:
@annotation.tailrec
def containsDups[A](list: List[A], seen: Set[A] = Set[A]()): Boolean =
list match {
case x :: xs => if (seen.contains(x)) true else containsDups(xs, seen + x)
case _ => false
}
containsDups(List(1,1,2,3))
// Boolean = true
containsDups(List(1,2,3))
// Boolean = false
I realize you asked for easy and I don't now that this version is, but finding a duplicate is also finding if there is an element that has been seen before:
def containsDups[A](list: List[A]): Boolean = {
list.iterator.scanLeft(Set[A]())((set, a) => set + a) // incremental sets
.zip(list.iterator)
.exists{ case (set, a) => set contains a }
}
@annotation.tailrec
def containsDuplicates [T] (s: Seq[T]) : Boolean =
if (s.size < 2) false else
s.tail.contains (s.head) || containsDuplicates (s.tail)
I didn't measure this, and think it is similar to huynhjl's solution, but a bit more simple to understand.
It returns early, if a duplicate is found, so I looked into the source of Seq.contains, whether this returns early - it does.
In SeqLike, 'contains (e)' is defined as 'exists (_ == e)', and exists is defined in TraversableLike:
def exists (p: A => Boolean): Boolean = {
var result = false
breakable {
for (x <- this)
if (p (x)) { result = true; break }
}
result
}
I'm curious how to speed things up with parallel collections on multi cores, but I guess it is a general problem with early-returning, while another thread will keep running, because it doesn't know, that the solution is already found.
Summary:
I've written a very efficient function which returns both List.distinct
and a List
consisting of each element which appeared more than once and the index at which the element duplicate appeared.
Note: This answer is a straight copy of the answer on a related question.
Details:
If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List
(as ordering was significant) exactly once and returns a Tuple2
consisting of the original List
deduped (all duplicates after the first are removed; i.e. the same as invoking distinct
) and a second List
showing each duplicate and an Int
index at which it occurred within the original List
.
Here's the function:
def filterDupes[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(remaining: List[A], index: Int, accumulator: (List[A], List[(A, Int)])): (List[A], List[(A, Int)]) =
if (remaining.isEmpty)
accumulator
else
recursive(
remaining.tail
, index + 1
, if (accumulator._1.contains(remaining.head))
(accumulator._1, (remaining.head, index) :: accumulator._2)
else
(remaining.head :: accumulator._1, accumulator._2)
)
val (distinct, dupes) = recursive(items, 0, (Nil, Nil))
(distinct.reverse, dupes.reverse)
}
An below is an example which might make it a bit more intuitive. Given this List of String values:
val withDupes =
List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")
...and then performing the following:
val (deduped, dupeAndIndexs) =
filterDupes(withDupes)
...the results are:
deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))
And if you just want the duplicates, you simply map
across dupeAndIndexes
and invoke distinct
:
val dupesOnly =
dupeAndIndexs.map(_._1).distinct
...or all in a single call:
val dupesOnly =
filterDupes(withDupes)._2.map(_._1).distinct
...or if a Set
is preferred, skip distinct
and invoke toSet
...
val dupesOnly2 =
dupeAndIndexs.map(_._1).toSet
...or all in a single call:
val dupesOnly2 =
filterDupes(withDupes)._2.map(_._1).toSet
This is a straight copy of the filterDupes
function out of my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._
.
If you're trying to check for duplicates in a test then ScalaTest can be helpful.
import org.scalatest.Inspectors._
import org.scalatest.Matchers._
forEvery(list.distinct) { item =>
withClue(s"value $item, the number of occurences") {
list.count(_ == item) shouldBe 1
}
}
// example:
scala> val list = List(1,2,3,4,3,2)
list: List[Int] = List(1, 2, 3, 4, 3, 2)
scala> forEvery(list) { item => withClue(s"value $item, the number of occurences") { list.count(_ == item) shouldBe 1 } }
org.scalatest.exceptions.TestFailedException: forEvery failed, because:
at index 1, value 2, the number of occurences 2 was not equal to 1 (<console>:19),
at index 2, value 3, the number of occurences 2 was not equal to 1 (<console>:19)
in List(1, 2, 3, 4)