39

I have a list of unsorted integers and I want to find those elements which have duplicates.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

I can find the distinct elements of the set with dup.distinct, so I wrote my answer as follows.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
val distinct = dup.distinct
val elementsWithCounts = distinct.map( (a:Int) => (a, dup.count( (b:Int) => a == b )) )
val duplicatesRemoved = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 <= 1 } )
val withDuplicates = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 > 1 } )

Is there an easier way to solve this?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Phil
  • 46,436
  • 33
  • 110
  • 175

7 Answers7

68

Try this:

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup.groupBy(identity).collect { case (x, List(_,_,_*)) => x }

The groupBy associates each distinct integer with a list of its occurrences. The collect is basically map where non-matching elements are ignored. The match pattern following case will match integers x that are associated with a list that fits the pattern List(_,_,_*), a list with at least two elements, each represented by an underscore since we don't actually need to store those values (and those two elements can be followed by zero or more elements: _*).

You could also do:

dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => x }

It's much faster than the approach you provided since it doesn't have to repeatedly pass over the data.

dhg
  • 52,383
  • 8
  • 123
  • 144
  • 14
    `List(_,_,_*)` could be replaced by `_::_::_`. Whether or not that is clearer depends. Also `ys.size > 1` can be replaced with `ys.lengthCompare(1) > 0` to avoid traversing the entire list of duplicates just to find the size. – The Archetypal Paul Jul 14 '14 at 06:46
  • Really required the explanation - thanks! With that, it's actually rather straightforward. Learned the '_*' trick - nice! – akauppi Jul 30 '15 at 14:03
  • 1
    Note: for some reason, this code failed to detect my duplicates (from a list of filenames), but @LuigiPlinge's answer worked. – akauppi Jul 30 '15 at 14:13
  • 2
    I wanted to extract the duplicates and the # of times they were repeated, the following worked well and was readable for me: `dup.groupBy(identity).map(t => (t._1, t._2.size))`. Result: `Map(101 -> 2, 5 -> 2, 1 -> 3, 6 -> 1, 102 -> 1, 2 -> 1, 3 -> 1, 4 -> 1, 100 -> 1)` – Ali Nov 29 '17 at 18:20
  • @akauppi Maybe you had the same issue I just had. The partial function `case (x, List(_,_,_*))` wouldn't match anything, because my input `Seq` was a Buffer and thus I had to match using `case (x, Buffer(_,_,_*))` instead... Really annoying to track down. So I don't really like this approach, it's too much implementation dependant. – Etienne Neveu Mar 16 '18 at 10:28
  • `dup.groupBy(identity).collect { case (x, items) if items.size > 1 => x }` should work too and it's easier to read, IMO. – Alvaro Mendez Jan 27 '21 at 19:50
39

A bit late to the party, but here's another approach:

dup.diff(dup.distinct).distinct

diff gives you all the extra items above those in the argument (dup.distinct), which are the duplicates.

Luigi Plinge
  • 50,650
  • 20
  • 113
  • 180
4

Another approach is to use foldLeft and do it the hard way.

We start with two empty sets. One is for elements that we have seen at least once. The other for elements that we have seen at least twice (aka duplicates).

We traverse the list. When the current element has already been seen (seen(cur)) it is a duplicate and therefore added to duplicates. Otherwise we add it to seen. The result is now the second set that contains the duplicates.

We can also write this as a generic method.

def dups[T](list: List[T]) = list.foldLeft((Set.empty[T], Set.empty[T])){ case ((seen, duplicates), cur) => 
  if(seen(cur)) (seen, duplicates + cur) else (seen + cur, duplicates)      
}._2

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

dups(dup) //Set(1,5,101)
Kigyo
  • 5,668
  • 1
  • 20
  • 24
3

Summary: I've written a very efficient function which returns both List.distinct and a List consisting of each element which appeared more than once and the index at which the element duplicate appeared.

Details: If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List (as ordering was significant) exactly once and returns a Tuple2 consisting of the original List deduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second List showing each duplicate and an Int index at which it occurred within the original List.

I have implemented the function twice based on the general performance characteristics of the Scala collections; filterDupesL (where the L is for Linear) and filterDupesEc (where the Ec is for Effectively Constant).

Here's the "Linear" function:

def filterDupesL[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head)) //contains is linear
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

An below is an example which might make it a bit more intuitive. Given this List of String values:

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...and then performing the following:

val (deduped, dupeAndIndexs) =
  filterDupesL(withDupes)

...the results are:

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

And if you just want the duplicates, you simply map across dupeAndIndexes and invoke distinct:

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...or all in a single call:

val dupesOnly =
  filterDupesL(withDupes)._2.map(_._1).distinct

...or if a Set is preferred, skip distinct and invoke toSet...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...or all in a single call:

val dupesOnly2 =
  filterDupesL(withDupes)._2.map(_._1).toSet

For very large Lists, consider using this more efficient version (which uses an additional Set to change the contains check in effectively constant time):

Here's the "Effectively Constant" function:

def filterDupesEc[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , seenAs: Set[A] =
        Set()
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else {
      val (isInSeenAs, seenAsNext) = {
        val isInSeenA =
          seenAs.contains(remaining.head) //contains is effectively constant
        (
            isInSeenA
          , if (!isInSeenA)
              seenAs + remaining.head
            else
              seenAs
        )
      }
      recursive(
          remaining.tail
        , index + 1
        , seenAsNext
        , if (isInSeenAs)
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
    }
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

Both of the above functions are adaptations of the filterDupes function in my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.

chaotic3quilibrium
  • 5,661
  • 8
  • 53
  • 86
3
def findDuplicates[T](seq: Seq[T]): Seq[T] = {
  seq.groupMapReduce(identity)(_ => false)((_, _) => true).filter(_._2).keys.toSeq
}

We start by associating every element with false (map phase), and as soon as a duplicate is found, we associate it with true (reduce phase). We leverage the fact that for each element reduce phase is done only if that element is a duplicate. Then we filter keeping only elements associated with true (filter phase).

This works with all implementations of trait Seq.

The time and space complexity are both O(n).

Comparing with other answers:

  1. @dhg answer does not work with all kinds of sequences. E.g. it works with List but will produce incorrect results for Array (because List pattern matching will not work on Array). Also, although it has the same time and space complexity, it creates a List for each element containing all the duplicates for the respective element, only to check that the list has more than one element. That means that the memory footprint and runtime would be higher.
  2. @Luigi Plinge answer is nice and elegant. However, it creates an intermediate hashmap-like datastructure 3 times (1 for diff and 2 for distinct), so the runtime is likely to be higher.
SlavaSt
  • 1,493
  • 1
  • 16
  • 19
  • this is the clearest and most efficient solution IMHO, my question is why not simply check the length of the group, like this `seq.groupBy(identity).filter(_._2.length > 1).keys.toSeq`? – David Perlman Aug 02 '23 at 08:20
  • @DavidPerlman You can do that, but it is less efficient. It may or may not be a concern depending on your use case. `seq.groupBy(identity)` would for each value create a `Seq` containing all the duplicates. We don't use that sequence, except to check its length. In contrast, `groupMapReduce` would not create such a sequence. – SlavaSt Aug 27 '23 at 15:01
1

I'm joining one liner party.

How about this:

(myDupList zip myDupList.tail).filter((m,n) => (m==n)).map((m,n) => m)
Rohlik
  • 1,286
  • 19
  • 28
0

My favorite is

def hasDuplicates(in: List[Int]): Boolean = {
  val sorted = in.sortWith((i, j) => i < j)
  val zipped = sorted.tail.zip(sorted)
  zipped.exists(p => p._1 == p._2)
}
JNYRanger
  • 6,829
  • 12
  • 53
  • 81
  • 4
    Welcome to SO! We usually recommend adding explanations to your code to the person asking the question can better understand your answer. Try to avoid code-only answers when possible! – JNYRanger Jun 27 '17 at 17:51
  • 3
    This doesn't do what the OP asks, which is to find the duplicates, not just test there are some – The Archetypal Paul Jun 28 '17 at 06:56