How to find duplicate values in a list and merge them

Question

So basically for example of you have a list like:

l = ['a','b','a','b','c','c']

The output should be:

[['a','a'],['b','b'],['c','c']]

So basically put together the values that are duplicated into a list,

I tried:

l = ['a','b','a','b','c','c']
it=iter(sorted(l))
next(it)
new_l=[]
for i in sorted(l):
   new_l.append([])
   if next(it,None)==i:
      new_l[-1].append(i)
   else:
      new_l.append([])

But doesn't work, and if it does work it is not gonna be efficient

Chris_Rands · Accepted Answer · 2018-10-12T09:03:00.910

4

Sort the list then use itertools.groupby:

>>> from itertools import groupby
>>> l = ['a','b','a','b','c','c']
>>> [list(g) for _, g in groupby(sorted(l))]
[['a', 'a'], ['b', 'b'], ['c', 'c']]

EDIT: this is probably not the fastest approach, sorting is O(n log n) time complexity for the average case and not required for all solutions (see the comments)

edited Oct 12 '18 at 09:03

answered Oct 12 '18 at 08:52

Chris_Rands

38,994
14
83
119

1

This requires an average time complexity of O(n log n), however. – blhsing Oct 12 '18 at 08:54
1

@blhsing Yes, I know, I'm not actually sure this is the best solution it was just my first thought (one needs to be quick on SO), I will defer judgement to a `timeit` benchmark – Chris_Rands Oct 12 '18 at 08:56
1

@Chris_Rands It's known that Python's `sorted` function has an average time complexity of O(n log n). – blhsing Oct 12 '18 at 08:57
1

@blhsing yes you just said that, I agree :) – Chris_Rands Oct 12 '18 at 09:01
Accepted.., didn't realize `itertools.groupby` can do this much :-) – U13-Forward Oct 12 '18 at 09:06
2

@U9-Forward Thanks but I'm not convinced this is the best way, Austin or Blhsing's solutions might be faster, and will retain the order if the `OrderedCounter` recipe is added – Chris_Rands Oct 12 '18 at 09:08
@Chris_Rands or if the Python version remembers dict insertion order, i.e. 3.6 and above. – timgeb Oct 12 '18 at 09:10
@timgeb indeed or 3.7 and above for the guarantee across all python implementations – Chris_Rands Oct 12 '18 at 09:14
@timgeb BTW i think the reason I thought of this first and it seems so intuitive is it follow a typical command-line (unix) pattern of `sort | uniq -c` – Chris_Rands Oct 12 '18 at 09:22

score 4 · Answer 2 · edited Oct 14 '21 at 16:58

4

You can use collections.Counter:

from collections import Counter
[[k] * c for k, c in Counter(l).items()]

This returns:

[['a', 'a'], ['b', 'b'], ['c', 'c']]

`%%timeit` comparison

Given a sample dataset of 100000 values, this answer is the fastest approach.

edited Oct 14 '21 at 16:58

Trenton McKinney

56,955
33
144
158

answered Oct 12 '18 at 08:54

blhsing

91,368
6
71
106

1

Works too, nice – U13-Forward Oct 12 '18 at 08:56
3

Note that `Counter()` has an average time complexity of O(n). – blhsing Oct 12 '18 at 09:00

Austin · Answer 3 · 2018-10-12T08:56:57.350

4

Use collections.Counter:

from collections import Counter

l = ['a','b','a','b','c','c']
c = Counter(l)

print([[x] * y for x, y in c.items()])
# [['a', 'a'], ['b', 'b'], ['c', 'c']]

edited Oct 12 '18 at 08:56

answered Oct 12 '18 at 08:54

Austin

25,759
4
25
48

1

Works too, nice – U13-Forward Oct 12 '18 at 08:56
3

This is the best solution. Easy to read and does not require sorting (if you use a Python version where dicts remember insertion order). – timgeb Oct 12 '18 at 09:02
@timgeb Agreed! Although of course sorting and retaining the insertion order and not always going to produce the same output (although they do for this data); don't know what the OP wants actually for sure – Chris_Rands Oct 12 '18 at 09:25

jpp · Answer 4 · 2018-10-12T09:04:44.100

1

Here's a functional solution via itertools.groupby. As it requires sorting, this will have time complexity O(n log n).

from itertools import groupby
from operator import itemgetter

L = ['a','b','a','b','c','c']

res = list(map(list, map(itemgetter(1), groupby(sorted(L)))))

[['a', 'a'], ['b', 'b'], ['c', 'c']]

The syntax is cumbersome since Python does not offer native function composition. This is supported by 3rd party library toolz:

from toolz import compose

foo = compose(list, itemgetter(1))
res = list(map(foo, groupby(sorted(L))))

edited Oct 12 '18 at 09:04

answered Oct 12 '18 at 08:54

jpp

159,742
34
281
339

1

Works too, nice – U13-Forward Oct 12 '18 at 08:56

score 1 · Answer 5 · answered Oct 12 '18 at 09:02

1

Another approach is to use zip method.

l = ['a','b','a','b','c','c','b','c', 'a']
l = sorted(l)
grouped = [list(item) for item in list(zip(*[iter(l)] * l.count(l[0])))]

Output

[['a', 'a', 'a'], ['b', 'b', 'b'], ['c', 'c', 'c']]

answered Oct 12 '18 at 09:02

Mihai Alexandru-Ionut

47,092
13
101
128

Works too, nice – U13-Forward Oct 12 '18 at 09:02

nikeros · Answer 6 · 2021-10-13T15:49:46.727

My solution using list comprehension would be (l is a list):

[l.count(x) * [x] for x in set(l)]

set(l) will retrieve all the element which appears in l, without duplicates
l.count(x) will return the number of times a specific element x appears in a given list l
the * operator creates a new list with the elements in a list (in this case, [x]) repeated the specified number of times (in this case, l.count(x) is the number of times)

score 0 · Answer 7 · answered Oct 12 '18 at 09:07

0

l = ['a','b','a','b','c','c']

want = []
for i in set(l):
    want.append(list(filter(lambda x: x == i, l)))
print(want)

answered Oct 12 '18 at 09:07

r.user.05apr

5,356
3
22
39

1

time complexity O(n**2) – timgeb Oct 12 '18 at 09:08
Works too, nice – U13-Forward Oct 12 '18 at 09:21
Timgeb you are right, but maybe size/speed des not matter. – r.user.05apr Oct 12 '18 at 12:48
1

While this might answer the authors question, it lacks some explaining words and links to documentation. Raw code snippets are not very helpful without some phrases around it. You may also find [how to write a good answer](https://stackoverflow.com/help/how-to-answer) very helpful. Please edit your answer. – hellow Oct 16 '18 at 07:36

score -1 · Answer 8 · edited Oct 12 '18 at 09:01

-1

Probably not the most efficient, but this is understandable:

l = ['a','b','a','b','c','c']
dict = {}
for i in l:
    if dict[i]:
        dict[i] += 1
    else:
         dict[i] = 1

new = []
for key in list(dict.keys()):
    new.append([key] * dict[key])

edited Oct 12 '18 at 09:01

timgeb

76,762
20
123
145

answered Oct 12 '18 at 08:55

DanDeg

316
1
2
7

This results in a `KeyError` Also, do not use built-in python functions (`dict`) as a variable name. – Trenton McKinney Oct 14 '21 at 16:50

How to find duplicate values in a list and merge them

8 Answers8

%%timeit comparison

`%%timeit` comparison