1

I have a big DataFrame (~1 Milion lines) and i need to delete some rows based on the unique identifier Trade_Id. I have the content of this rows (45000 on my test database) on another DataFrame variable called tib. My approach is this one

lentib=len(tib)
for i in range(0,lentib,1): # VERY SLOW
    dat=dat[dat.Trade_Id!=tib.Trade_Id[i]]

But the problem is that it is very slow and doing dat[dat.Trade_Id!=tib.Trade_Id] does not work.

Someone have a better idea in order to be more computationally efficient? I have other databases like this one to work with and I would not like to be two days computing this.

VCLL
  • 158
  • 1
  • 7

1 Answers1

1

Filter them out using isin and negation operator ~:

dat=dat[~dat['Trade_Id']isin(tib)]

isin will produce a boolean mask and you can invert this using ~ this will remove the rows that contain the id's

Example:

In [127]:
df = pd.DataFrame({'col1':np.arange(10)})
df

Out[127]:
   col1
0     0
1     1
2     2
3     3
4     4
5     5
6     6
7     7
8     8
9     9

In [128]:    
df[~df['col1'].isin([2,5,8])]

Out[128]:
   col1
0     0
1     1
3     3
4     4
6     6
7     7
9     9

The mask looks like this:

In [129]:
df['col1'].isin([2,5,8])

Out[129]:
0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False
8     True
9    False
Name: col1, dtype: bool
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Oh, thank you! I was sure there were some more efficient way but I wasn't able to find it. It works very fast. – VCLL Sep 25 '15 at 09:38