4

My question is similar to this previous SO question. I have two very large lists of data (almost 20 million data points) that contain numerous consecutive duplicates. I would like to remove the consecutive duplicate as follows:

list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]  # This is 20M long!
list2 = ...  # another list of size len(list1), also 20M long!
i = 0
while i < len(list)-1:
    if list[i] == list[i+1]:
        del list1[i]
        del list2[i]
    else:
        i = i+1

And the output should be [1, 2, 3, 4, 5, 1, 2] for the first list. Unfortunately, this is very slow since deleting an element in a list is a slow operation by itself. Is there any way I can speed up this process? Please note that, as shown in the above code snipped, I also need to keep track of the index i so that I can remove the corresponding element in list2.

Georgy
  • 12,464
  • 7
  • 65
  • 73
Dillion Ecmark
  • 704
  • 3
  • 10
  • 23
  • Maybe you can try to write the performance-critical part in C or C++ and invoke the method in Python. This should be faster compared to pure Python code. – jdhao Jul 15 '19 at 09:19

2 Answers2

9

Python has this groupby in the libraries for you:

>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [k for k,_ in groupby(list1)]
[1, 2, 3, 4, 5, 1, 2]

You can tweak it using the keyfunc argument, to also process the second list at the same time.

>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> list2 = [9,9,9,8,8,8,7,7,7,6,6,6,5]
>>> from operator import itemgetter
>>> keyfunc = itemgetter(0)
>>> [next(g) for k,g in groupby(zip(list1, list2), keyfunc)]
[(1, 9), (2, 7), (3, 7), (4, 7), (5, 6), (1, 6), (2, 5)]

If you want to split those pairs back into separate sequences again:

>>> zip(*_)  # "unzip" them
[(1, 2, 3, 4, 5, 1, 2), (9, 7, 7, 7, 6, 6, 5)]
wim
  • 338,267
  • 99
  • 616
  • 750
  • 1
    A few wins here. First, you're using the standard library rather than rewriting code. Itertools is implemented in C for speed. Finally, you're not trying to repeatedly modify a list in place. Depending on where the data comes from or what you're doing, starting with generators and avoiding a 20MB list in the first place or keeping the results as a generator might be more efficient. – Sean McSomething Jan 06 '17 at 18:18
  • 1
    Whoah! I can't believe I spent almost an entire day on this. Your solution is fast. And when I mean fast, it reduces the execution time from two hours to just 1 minute! Thanks a lot. Just a quick question, when I run you code in the python interpreter it works. However, running in pyCharm is get this nasty error that '_' in zip(*_) is not defined. Any ideas? – Dillion Ecmark Jan 06 '17 at 18:35
  • 1
    Assign the result of the list comprehension to a variable, and then use `zip(*result)`. I used a shortcut in the interpreter (`_` refers to "last evaluated result"). – wim Jan 06 '17 at 18:39
  • Thanks a lot! It works as expected! Although somehow when applied to my long list of data I get some unexpected values here and there but I guess it has to do with my old computer running out of memory! – Dillion Ecmark Jan 06 '17 at 19:05
0

You can use collections.deque and its max len argument to set a window size of 2. Then just compare the duplicity of the 2 entries in the window, and append to the results if different.

def remove_adj_dups(x):
"""
:parameter x is something like '1, 1, 2, 3, 3'
    from an iterable such as a string or list or a generator
:return 1,2,3, as list
"""

    result = []
    from collections import deque
    d = deque([object()], maxlen=2)  # 1st entry is object() which only matches with itself. Kudos to Trey Hunner -->object()

    for i in x:
        d.append(i)
        a, b = d
        if a != b:
            result.append(b)
    return result

I generated a random list with duplicates of 20 million numbers between 0 and 10.

def random_nums_with_dups(number_range=None, range_len=None):
    """
    :parameter
    :param number_range: use the numbers between 0 and number_range. The smaller this is then the more dups
    :param range_len: max len of the results list used in the generator
    :return: a generator

    Note: If number_range = 2, then random binary is returned
    """

    import random
    return (random.choice(range(number_range)) for i in range(range_len))

I then tested with

range_len = 2000000
def mytest():
    for i in [1]:
        return [remove_adj_dups(random_nums_with_dups(number_range=10, range_len=range_len))]
big_result = mytest()

big_result = mytest()[0]
print(len(big_result))

The len was 1800197 (read dups removed), in <5 secs, which includes the random list generator spinning up. I lack the experience/knowhow to say if it is memory efficient as well. Could someone comment please

DaftVader
  • 105
  • 1
  • 11