java 8, most efficient method to return duplicates from a list (not remove them)?

Question

I have an ArrayList of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by.

I was able to come up with this:

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

But this uses Collections.frequency, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings. This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence:

protected boolean moreThanOne(Collection<?> c, Object o) {
    boolean found = false;
    if (o != null) {
        for (Object e : c) {
            if (o.equals(e)) {
                if (found) {
                    return found;
                } else {
                    found = true;
                }
            }
        }
    }
    return found;
}

and changed my method to use it:

public synchronized List<String> listMatching(List<String> allStrings)   {
    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for moreThanOne(): "+ elapsedTime);

    return duplicates;
}

This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task.

I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else.

Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach?

EDIT: Thanks to all your suggestions, I was able to come up with something significantly faster:

public synchronized Set<String> listMatching(List<String> allStrings) {

    Set<String> allItems = new HashSet<>();
    Set<String> duplicates = allStrings.stream()
            .filter(string -> !allItems.add(string))
            .collect(Collectors.toSet());

    return duplicates;
}

Running under the same conditions, this is able to go through my list in <5ms. All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the Collections.frequency() method doesn't use that technique.

you could use a map, with your objects as keys and as a value a counter variable. At last you then just get the keys which have a value `> 1`. But this may be as slow as your solution or even slower — Lino, Aug 23 '17 at 17:03
"Do you see any way that I could further optimize this code?" You can use a HashMap where String is what you are storing and Integer would be the frequency. Your moreThanOne method would than be O(1) compared to O(n) of now. I think it will give reduce the time greatly — Anand Undavia, Aug 23 '17 at 17:05
if you can *remove* them, you can store them in a new list, which would be the best solution. — , Aug 23 '17 at 18:00

dspano · Accepted Answer · 2017-08-23T18:29:54.777

4

An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set.

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}

edited Aug 23 '17 at 18:29

answered Aug 23 '17 at 17:37

dspano

1,540
13
25

1

I found this was the fastest technique, but I implemented it a bit differently. See my edit on the question. – Jonathon Hoaglin Aug 24 '17 at 15:00

Rob · Answer 2 · 2017-08-23T17:19:23.477

A good way to make this really scalable is to build a Map that contains the count of each string. To build the map, you will look up each string in your list. If the string is not yet in the map, put the string and a count of one in the map. If the string is found in the map, increment the count.

You probably want to use some type that allows you to increment the count in-place, rather than having to put() the updated count each time. For example, you can use an int[] with one element.

The other advantage of not re-putting counts is that it is easy to execute in parallel, because you can synchronize on the object that contains your count when you want to read/write the count.

The non-parallel code might look something like this:

Map<String, int[]> map = new HashMap<>(listOfStrings.size());
for (String s: listOfStrings) {
    int[] curCount = map.get(s);
    if (curCount == null) {
        curCount = new int[1];
        curCount[0] = 1;
        map.put(s, curCount);
    } else {
        curCount[0]++;
    }
}

Then you can iterate over the map entries and do the right thing based on the count of each string.

The answer at https://stackoverflow.com/a/24872936/284126 is just nice a word count example that uses computeIfPresent to increment the count in-place — tanyehzheng, Aug 23 '17 at 17:49

nagendra547 · Answer 3 · 2017-08-23T17:27:33.170

1

Best data-structure will be Set<String>.

Add all elements from list in set.

Delete elements from set one by one traversing from list.

If element not found in set then it's duplicate in list. (Because it's already deleted)

this will take O(n)+O(n).

coding-

    List<String> list = new ArrayList<>();
    List<String> duplicates = new ArrayList<>();

    list.add("luna");
    list.add("mirana");
    list.add("mirana");
    list.add("mirana");

    Set<String> set = new HashSet<>();
    set.addAll(list);
    for(String a:list){
        if(set.contains(a)){
            set.remove(a);
        }else{
            duplicates.add(a);
        }
    }
    System.out.println(duplicates);

Output

[mirana, mirana]

edited Aug 23 '17 at 17:27

answered Aug 23 '17 at 17:08

nagendra547

5,672
3
29
43

With duplicates collection being a List, you will end up with multiple copies of a string if it appears more than twice in the original list. Making duplicates a Set fixes that issue, but you then need to turn the set into a list (if that really is a requirement). – Rob Aug 23 '17 at 17:23
Here we don't want to change the original list that' why I am making a duplicate here. If we can change the original list then simply don't store in duplicate, but remove from original list looping over set. Hope it's clear. – nagendra547 Aug 23 '17 at 17:26
I prefer the Set approach (vs. Map in another answer) if you know that duplicates will be rare, as you do not end up with a collection that contains an entry for all of the non-duplicates. – Rob Aug 23 '17 at 17:27
My comment was based on the assumption that the OP wants your example to produce [mirana] rather than outputting it twice. I re-read the question and it is not clear to me exactly what output is desired in this case. Either way, that detail is easy to code correctly either way. – Rob Aug 23 '17 at 17:33
If user wants to display only unique duplicate then in case of duplicateList structure, he/she can go with duplicateSet that will contains unique items. – nagendra547 Aug 23 '17 at 17:36
why down voting ? – nagendra547 Aug 24 '17 at 04:19

java 8, most efficient method to return duplicates from a list (not remove them)?

3 Answers3