3

So I have a specific problem that needs to be solved. I need to DELETE elements present in one pandas series (ser1) that are common to another pandas series (ser2).

I have tried a bunch of things that do not work and the closest thing I was able to find was with arrays using np.intersect1d() function. This works to find common values, but when I try to drop indexes that are equal to these values, i get a bunch of mistakes.

I've tried a bunch of other things that did not really work and have been at it for 3 hours now so about to give up.

here are the two series:

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

The result should be:

print(ser1)
0   1
1   2
2   3

I am sure there is a simple solution.

E_net4
  • 27,810
  • 13
  • 101
  • 139
JacobMarlo
  • 87
  • 7

3 Answers3

7

Use .isin:

>>> ser1[~ser1.isin(ser2)]
0    1
1    2
2    3
dtype: int64

The numpy version is .setdiff1d (and not .intersect1d)

>>> np.setdiff1d(ser1, ser2)
array([1, 2, 3])
Corralien
  • 109,409
  • 8
  • 28
  • 52
5

A numpy alternative, np.isin

import pandas as pd
import numpy as np

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

res = ser1[~np.isin(ser1, ser2)]
print(res)

Micro-Benchmark

import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5] * 100)
ser2 = pd.Series([4, 5, 6, 7, 8] * 10)
%timeit res = ser1[~np.isin(ser1, ser2)]
136 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit res = ser1[~ser1.isin(ser2)]
209 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Index(ser1).difference(ser2).to_series()
277 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
2

You can use set notation - I am not sure of the speed though, compared to isin:

pd.Index(ser1).difference(ser2).to_series()
Out[35]: 
1    1
2    2
3    3
dtype: int64
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • This works well thank you, is there a reason as to why the index of the new series doesn't start at 0 though? – JacobMarlo Oct 29 '21 at 20:05
  • 1
    ahhh ... so the index is repeated, both as an index, and as a Series. within the `to_series` method, you can manually pass in the new index. or just reset_index – sammywemmy Oct 29 '21 at 20:06
  • To reset index I used: ser1 = `(pd.Index(ser1).difference(ser2).to_series()) ser1 = ser1.reset_index() print(ser1)` and it gave me this as an answer : `Name: alphabets, dtype: object index 0 0 1 1 1 2 2 2 3 3` – JacobMarlo Oct 29 '21 at 20:16
  • 1
    use ``reset_index(drop=True)`` – sammywemmy Oct 29 '21 at 20:17
  • Works, thank you for your answer, much appreciated! – JacobMarlo Oct 29 '21 at 20:19