0

I have two lists of string names:

>>> len(list_a)
14740
>>> len(list_b)
14277

I need to get the 463 names in list_a that are not in list_b. Based on other articles and answers, I have tried:

a_set = set(list_a)
b_set = set(list_b)

diff1 = a_set - b_set
>>> len(diff1)
1457

diff2 = b_set - a_set
>>> len(diff2)
994

Interestingly 1457 - 994 = 463. So it feels like I am close but missing or misunderstanding something. What am I missing?

Python 3

Thanks.

screwed
  • 73
  • 1
  • 1
  • 9
  • 4
    It means the two sets don't have uniquely overlapping elements. – Alex Huszagh Jun 30 '17 at 21:16
  • Have you checked capitalization? Maybe you have 994 values which are eg "Ann" in one list and "ANN" in the other? – Hugh Bothwell Jun 30 '17 at 21:17
  • 2
    How do you know there are in fact 463 names in list_a that are not in list_b? You are assuming that because list_a is longer, it contains all the elements in list_b, plus more. But that's not necessarily the case. It may be that there are names in each list that are not in the other. – BrenBarn Jun 30 '17 at 21:18
  • 1
    You can try a something like `diff = [i for i in list_a if i not in set(list_b)]` – Vinícius Figueiredo Jun 30 '17 at 21:19
  • check len(list_a) == len(a_set) – PRMoureu Jun 30 '17 at 21:20
  • 3
    This isn't a Python problem, it's a set theory problem. Your results indicate that list_b is not a subset of list_a. len(diff1) - len(diff2) must be the same as len(a_set) - len(b_set). Mathematically, it couldn't possibly be otherwise. What you're missing: both lists contain some names in common and some that aren't. – Paul Cornelius Jun 30 '17 at 21:30
  • Can you create an example with roughly 5 elements? See how to create a [mcve]. – Peter Wood Jun 30 '17 at 21:36

1 Answers1

1

list_b is not necessarily a subset of the data that is contained entirely within list_a. Consider two much smaller lists with similar make ups.

list_a = [1, 2, 4, 4, 6]
list_b = [1, 3, 4]

As you can see, list_b contains 3 which is not in list_a, but the length of list_a is still greater.

Simple Loop

If you are attempting to get the values that are in list_a and not in list_b, the following is a pretty direct translation in Python. Let's convert list_b to a set so we can get a constant time lookup for element containment.

list_a = [1, 2, 4, 4, 6]
list_b = [1, 3, 4]
set_b = set(list_b)
list_result = []
for a_ele in list_a:
    if a_ele not in set_b:
        list_result.append(a_ele)

print(list_result)
# [2, 6]

Note: If you do not want duplicate values in your result list, you could simply iterate over set(list_a) instead of list_a in the for loop.

Set Logic

You were right to think of using set logic to answer this, which can be accomplished basically as you've written.

set_a = set(list_a)
set_b = set(list_b)

list_result = list(set_a - set_b)
print(list_result)
# [2, 6]

This will create a list that has all of the elements in list_a with the elements in list_b removed.

foslock
  • 3,639
  • 2
  • 22
  • 26
  • 1
    Note that the simple loop version can produce duplicates in the result, e.g. `list_a = [1, 2, 2, 4, 6]` will produce `[2, 2, 6]` – Barmar Jun 30 '17 at 21:36
  • @Barmar Thanks for the note, edited to include notice about iterating over the set instead. – foslock Jun 30 '17 at 21:39
  • 1
    Thanks! This and BrenBarn's comment helped me get perspective. Adding the 994 names to list_b then subtracting 1457 resulted in both lists being identical. – screwed Jun 30 '17 at 22:56