unexpected results getting differences between python list A and B

Question

I have two lists of string names:

>>> len(list_a)
14740
>>> len(list_b)
14277

I need to get the 463 names in list_a that are not in list_b. Based on other articles and answers, I have tried:

a_set = set(list_a)
b_set = set(list_b)

diff1 = a_set - b_set
>>> len(diff1)
1457

diff2 = b_set - a_set
>>> len(diff2)
994

Interestingly 1457 - 994 = 463. So it feels like I am close but missing or misunderstanding something. What am I missing?

Python 3

Thanks.

It means the two sets don't have uniquely overlapping elements. — Alex Huszagh, Jun 30 '17 at 21:16
Have you checked capitalization? Maybe you have 994 values which are eg "Ann" in one list and "ANN" in the other? — Hugh Bothwell, Jun 30 '17 at 21:17
How do you know there are in fact 463 names in list_a that are not in list_b? You are assuming that because list_a is longer, it contains all the elements in list_b, plus more. But that's not necessarily the case. It may be that there are names in each list that are not in the other. — BrenBarn, Jun 30 '17 at 21:18
You can try a something like `diff = [i for i in list_a if i not in set(list_b)]` — Vinícius Figueiredo, Jun 30 '17 at 21:19
This isn't a Python problem, it's a set theory problem. Your results indicate that list_b is not a subset of list_a. len(diff1) - len(diff2) must be the same as len(a_set) - len(b_set). Mathematically, it couldn't possibly be otherwise. What you're missing: both lists contain some names in common and some that aren't. — Paul Cornelius, Jun 30 '17 at 21:30
Can you create an example with roughly 5 elements? See how to create a [mcve]. — Peter Wood, Jun 30 '17 at 21:36

foslock · Accepted Answer · 2017-06-30T21:38:37.873

list_b is not necessarily a subset of the data that is contained entirely within list_a. Consider two much smaller lists with similar make ups.

list_a = [1, 2, 4, 4, 6]
list_b = [1, 3, 4]

As you can see, list_b contains 3 which is not in list_a, but the length of list_a is still greater.

Simple Loop

If you are attempting to get the values that are in list_a and not in list_b, the following is a pretty direct translation in Python. Let's convert list_b to a set so we can get a constant time lookup for element containment.

list_a = [1, 2, 4, 4, 6]
list_b = [1, 3, 4]
set_b = set(list_b)
list_result = []
for a_ele in list_a:
    if a_ele not in set_b:
        list_result.append(a_ele)

print(list_result)
# [2, 6]

Note: If you do not want duplicate values in your result list, you could simply iterate over set(list_a) instead of list_a in the for loop.

Set Logic

You were right to think of using set logic to answer this, which can be accomplished basically as you've written.

set_a = set(list_a)
set_b = set(list_b)

list_result = list(set_a - set_b)
print(list_result)
# [2, 6]

This will create a list that has all of the elements in list_a with the elements in list_b removed.

Note that the simple loop version can produce duplicates in the result, e.g. `list_a = [1, 2, 2, 4, 6]` will produce `[2, 2, 6]` — Barmar, Jun 30 '17 at 21:36
@Barmar Thanks for the note, edited to include notice about iterating over the set instead. — foslock, Jun 30 '17 at 21:39
Thanks! This and BrenBarn's comment helped me get perspective. Adding the 994 names to list_b then subtracting 1457 resulted in both lists being identical. — screwed, Jun 30 '17 at 22:56

unexpected results getting differences between python list A and B

1 Answers1

Simple Loop

Set Logic