Delete duplicated of an object within a list

Question

I got a waveform object, define as following:

class wfm:
    """Class defining a waveform characterized by:
        - A name
        - An electrode configuration
        - An amplitude (mA)
        - A pulse width (microseconds)"""

    def __init__(self, name, config, amp, width=300):
        self.name = name
        self.config = config
        self.amp = amp
        self.width = width

    def __eq__(self, other):
        return type(other) is self.__class__ and other.name == self.name and other.config == self.config and other.amp == self.amp and other.width == self.width

    def __ne__(self, other):
        return not self.__eq__(other)

Through parsing, I get a list called waveforms with 770 instance of wfm in it. There is a lot of duplicate, and I need to delete them.

My idea was to get the ID of equivalent object, store the largest ID in a list, and then loop on all the waveforms from the end while popping out each duplicate.

Code:

duplicate_ID = []
for i in range(len(waveforms)):
    for j in range(i+1, len(waveforms)):
        if waveforms[i] == waveforms[j]:
            duplicate_ID.append(waveforms.index(waveforms[j]))
            print ('{} eq {}'.format(i, j))

duplicate_ID = list(set(duplicate_ID)) # If I don't do that; 17k IDs

Turns out (thx to the print) that I have duplicates that don't appear into the ID list, for instance 750 is a duplicate of 763 (print says it; test too) and yet none of this 2 IDs appears in my duplicate list.

I'm quite sure there is a better solution that this method (which doesn't yet work), and I would be glad to hear it. Thanks for the help!

EDIT: More complicated scenario

I've got a more complicated scenario. I got 2 classes, wfm (see above) and stim:

class stim:
    """Class defining the waveform used for a stimultion by:
        - Duration (milliseconds)
        - Frequence Hz)
        - Pattern of waveforms"""

    def __init__(self, dur, f, pattern):
        self.duration = dur
        self.fq = f
        self.pattern = pattern

    def __eq__(self, other):
        return type(other) is self.__class__ and other.duration == self.duration and other.fq == self.fq and other.pattern == self.pattern

    def __ne__(self, other):
        return not self.__eq__(other)

I parse my files to fill a dict: paradigm. It looks like that:

paradigm[file name STR] = (list of waveforms, list of stimulations)

# example:
paradigm('myfile.xml') = ([wfm1, ..., wfm10], [stim1, ..., stim5])

Once again, I want to delete the duplicates, i.e. I want to keep only the data where:

Waveforms are the same
And stim is the same

Example:

file1 has 10 waveforms and file2 has the same 10 waveforms.
file1 has stim1 and stim2 ; file2 has stim3, sitm 4 and stim 5.

stim1 and stim3 are the same; so since the waveforms are also the same, I want to keep:
file1: 10 waveforms and stim1 and stim2
file2: 10 waveforms and stim 4 and stim5

The correlation is kinda messy in my head, so I got a few difficulties finding the right storage solution for wave forms and stimulation in order to compare them easly. If you got any idea, I'd be glad to hear it. Thanks!

This is irrelevant to your question but I think you wanted to make `width` a default argument but placed the value `300` in a wrong place, right? — Georgy, Feb 08 '18 at 15:18
Possible duplicate of [Remove duplicates in list of object with Python](https://stackoverflow.com/questions/4169252/remove-duplicates-in-list-of-object-with-python) — Georgy, Feb 08 '18 at 15:19
It will be much easier if you use [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple) instead of a `class`. `WFM = namedtuple('WFM', ['name', 'config', 'amp', 'width'])`. Then calling `set()` on a list of instances of `WFM` will remove all the duplicates. — Georgy, Feb 08 '18 at 15:26
Yep thanks for the remark on width... Small mistake thou since it's constant most of the time. I did find some post looking quite similar, and yet I haven't solve my problem, that's why I posted this. Thanks for the answer Alex, I'll try it ! — Mathieu, Feb 08 '18 at 15:45
Works perfectly. If I get it right, the immutable solution doesn't need to store indexes. Though I don't completely understand what it does and how that key works. Anyway, thanks! — Mathieu, Feb 09 '18 at 08:13

Alex · Accepted Answer · 2018-02-08T15:48:18.950

Problem

The .index method uses the .__eq__ method you overloaded. So

waveforms.index(waveforms[j])

will always find the first instance of the waveform in the list containing the same attributes as waveforms[j].

w1 = wfm('a', {'test_param': 4}, 3, 2.0)
w2 = wfm('b', {'test_param': 4}, 3, 2.0)
w3 = wfm('a', {'test_param': 4}, 3, 2.0)

w1 == w3  # True
w2 == w3  # False

waveforms = [w1, w2, w3]
waveforms.index(waveforms[2]) == waveforms.index(waveforms[0]) == 0  # True

Solution

Immutable

You don't need to store list indices if you do this immutably:

key = lambda w: hash(str(vars(w)))
dupes = set()
unique = [dupes.add(key(w)) or w for w in waveforms if key(w) not in dupes]

unique == [w1, w2]  # True

Mutable

key = lambda w: hash(str(vars(w)))
seen = set()
idxs = [i if key(w) in seen else seen.add(key(w)) for i, w in enumerate(waveforms)]

for idx in filter(None, idxs[::-1]):
    waveforms.pop(idx)

waveforms == [w1, w2]  # True

Big O Analysis

It's a good habit to think about big O complexity when writing algorithms (though optimization should come at the cost of readability only when needed). In this case, these solutions are a bit more readable and also big O optimal.

Your initial solution was O(n^2) due to the double for loop.

Both solutions provided are O(n).

From what you did, I got a quite similar problem that appeared; and I can't manage to solve it... Would you mind taking a look at the edit ? — Mathieu, Feb 09 '18 at 10:57