3

Assuming we got 2 lists, always with the same length and always containing strings.

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'gg', 'sot']

we need to find:

How many items of the list2 should change, in order for it to be equals with list1.

So on the previous example it should return 2

For this example:

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'sot', 'sot']

it should return 1

and finally for this example:

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['ts', 'ts', 'ts', 'ts', 'ts', 'ts']

it should return 5.

We do not care about which elements should change to what. We neither care about the order, so that means that

['gg', 'gg', 'gg', 'gg', 'gg', 'sot'] 
and
['gg', 'gg', 'sot', 'gg', 'gg', 'gg']

are equal and the result of them should be 0.

The length of the lists could be 6, 8, 20 or whatever and sometimes there are more elements in place.

I tried a lot of things like set(list1) - set(list2) ,list(set(list1).difference(list2)) , set(list1).symmetric_difference(set(list2)) but without any success.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
GeorgeGeorgitsis
  • 1,262
  • 13
  • 29

4 Answers4

3

You could leverage the many possibilities Counter offers:

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'gg', 'sot']

from collections import Counter

sum((Counter(list1) - Counter(list2)).values())
# 2

Lets check with the other examples:

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'sot', 'sot']

sum((Counter(list1) - Counter(list2)).values())
# 1

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['ts', 'ts', 'ts', 'ts', 'ts', 'ts']

sum((Counter(list1) - Counter(list2)).values())
# 5

list1 = ['gg', 'gg', 'gg', 'gg', 'gg', 'sot'] 
list2 = ['gg', 'gg', 'sot', 'gg', 'gg', 'gg']

sum((Counter(list1) - Counter(list2)).values())
# 0

Details

By using Counter, you will have a count of all elements from each list in the form of a dictionary. Lets go back to the first example:

c1 = Counter(list1)
# Counter({'sot': 2, 'ts': 1, 'gg': 3})

c2 = Counter(list2)
# Counter({'gg': 5, 'sot': 1})

Now we somehow would like to get an understanding of:

  • Which items are present in list1 but not in list2

  • Out of those that are present and also those there are not, how many more are needed in list2 so that they contain the same amount of counts

Well we could take advantage of the fact that counters support mathematical operations, the result of which produces multisets, i.e counters that have counts greater than zero. So given that we're looking for the difference between both counters it seems like we could subtract them and see what elements and their respective counts are needed in list2.

So how would subtraction between Counters work? Lets check with a simple example:

Counter({1:4, 2: 1}) - Counter({1:1, 3:1})  
# Counter({1: 3, 2: 1})

So what this doing is subtracting the counts of corresponding elements, so the elements contained in the first counter, thus order here is important. So going back to the proposed example subtracting both lists would yield:

 sub = Counter(list1) - Counter(list2)
# Counter({'sot': 1, 'ts': 1})

Now we simple need to count the values in all the keys, which can be done with:

sum(sub.values())
# 2
yatu
  • 86,083
  • 12
  • 84
  • 139
2

You can use collections.Counter for this, where you count how many items both lists have in them, and take the difference between them.

from collections import Counter
def func(list1, list2):
    #Convert both list to counters, and subtract them
    c = Counter(list1) - Counter(list2)

    #Sum up all values in the new counter
    return sum(c.values())

The outputs are

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'gg', 'sot']
print(func(list1, list2))
#2

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['gg', 'gg', 'gg', 'gg', 'sot', 'sot']
print(func(list1, list2))
#1

list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
list2 = ['ts', 'ts', 'ts', 'ts', 'ts', 'ts']
print(func(list1, list2))
#5
Devesh Kumar Singh
  • 20,259
  • 5
  • 21
  • 40
2

Using set will cause problems if the difference is in how many of a certain item are present. Instead, use collections.Counter. As explained in other answers, you can create a Counter for both lists and then use - to get the difference of those and get the sum of the values. Note, however, that this will only work if the lists have the same size. If the lists do not have the same number of elements, you will get a different number of diverging elements depending on which list is subtracted from which.

With subtract, on the other hand, you will get the difference in both directions, using positive numbers for items that are "too many" negative for "too few". This means, that you may have to divide the result by 2, i.e. sum(...) / 2, but it should work better for differently sized lists.

>>> list1 = ['sot', 'sot', 'ts', 'gg', 'gg', 'gg']
>>> list2 = ['gg', 'gg', 'gg', 'gg', 'sot', 'sot']
>>> c = Counter(list1)
>>> c.subtract(Counter(list2))
# Counter({'gg': -1, 'sot': 0, 'ts': 1})
>>> sum(map(abs, c.values()))
2

Another possibility, that also works reliably with differently sized lists, is using & to get the common elements and them comparing those to the total number of elements in the larger list:

>>> list1 = [1,1,1,1,2]
>>> list2 = [2]
>>> Counter(list1) & Counter(list2)
Counter({2: 1})
>>> max(len(list1), len(list2)) - sum((Counter(list1) & Counter(list2)).values())
4
tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • You don't want to use `subtract()`, because then you'd end up with 0, always. The two multisets have the same number of elements (N), so both counters have the same total value. If you use `subtract`, you end up with zero, always, as you get positive counts for the surplus of specific elements in one, and negative counts for the surplus of specific elements in the other, which will *always be the same number*. – Martijn Pieters May 14 '19 at 10:57
  • The order in which you subtract the multisets *doesn't matter*, because they have the same total. The output is always the surplus of the first multiset, which must be balanced by the surplus of the other multiset. – Martijn Pieters May 14 '19 at 10:58
  • @MartijnPieters Yeah, I think I mentioned that somewhere in that paragraph. I did not see the "always the same number of elements" note, but still, now this works whether they have the same number of elements or not, and as far as I've seen, none of the other answers mentions that they will _only_ work for the same number. – tobias_k May 14 '19 at 11:01
  • It is much easier to use `-` subtraction in both directions and so keep the counts separate, rather than have to turn the counts in to absolutes and then divide by two. – Martijn Pieters May 14 '19 at 11:15
  • @MartijnPieters Yes, I guess you could also add the sums for `c1-c2` and `c2-c1` in the different-sizes-case. Anyway, the point of my answer is that `-` only works if the lists have the same length, which I now understand is what OP was asking, but I think it's still worth providing an alternative that also works in the case of lists with unequal length, wouldn't you agree? – tobias_k May 14 '19 at 11:28
2

You are not talking about lists here. Your problem is a multiset problem, because order doesn't matter, but you do need to know how many values you have of each type. Multisets are sometimes called bags or msets.

The Python standard library has a multiset implementation: collections.Counter(), which map unique elements to a count. Use those here:

from collections import Counter

mset1 = Counter(list1)
mset2 = Counter(list2)

# sum the total number of elements that are different between
# the two multisets
sum((mset1 - mset2).values())

Subtracting one counter from another gives you a multiset of all elements that were in the first multiset but not in the other, and sum(mset.values()) adds up to the total number of elements.

Because the inputs are always the same length and you only need to know how many elements are different, it doesn't matter in which order you subtract the multisets. You will always get the right answer, both sum((mset1 - mset2).values()) and sum((mset2 - mset1).values()) will always produce the exact same number.

That's because both multisets have N elements, of which K are different. So both multisets will have exactly K extra elements that are not in the other multiset, and have K missing elements that are present in the other set. - subtraction will give you the K extra elements in the first set that are missing in the other.

Putting this into a function:

def mset_diff(iterable1, iterable2):
    return sum((Counter(iterable1) - Counter(iterable2)).values())

and applied to your inputs:

>>> mset_diff(['sot', 'sot', 'ts', 'gg', 'gg', 'gg'], ['gg', 'gg', 'gg', 'gg', 'gg', 'sot'])
2
>>> mset_diff(['sot', 'sot', 'ts', 'gg', 'gg', 'gg'], ['gg', 'gg', 'gg', 'gg', 'sot', 'sot'])
1
>>> mset_diff(['sot', 'sot', 'ts', 'gg', 'gg', 'gg'], ['ts', 'ts', 'ts', 'ts', 'ts', 'ts'])
5

The Counter() class is a subclass of dict, counting elements is fast and efficient, and calculating the difference between two is done in O(N) linear time.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343