1

I have this Python Pandas DataFrame DF :

DICT = {  'letter': ['A','B','C','A','B','C','A','B','C'],
          'number': [1,1,1,2,2,2,3,3,3],
          'word'  : ['one','two','three','three','two','one','two','one','three']}

DF = pd.DataFrame(DICT)

Which looks like :

  letter  number   word
0      A       1    one
1      B       1    two
2      C       1  three
3      A       2  three
4      B       2    two
5      C       2    one
6      A       3    two
7      B       3    one
8      C       3  three

And I want to extract the lines

  letter  number   word
       A       1    one
       B       2    two
       C       3  three

First I tired :

DF[(DF['letter'].isin(("A","B","C"))) & 
    DF['number'].isin((1,2,3))        &
    DF['word'].isin(('one','two','three'))]

Of course it didn't work, and everything has been selected

Then I tested :

Bool = DF[['letter','number','word']].isin(("A",1,"one"))
DF[np.all(Bool,axis=1)]

Good, it works ! but only for one line ... If we take the next step and give an iterable to .isin() :

Bool = DF[['letter','number','word']].isin((("A",1,"one"),
                                            ("B",2,"two"),
                                            ("C",3,"three")))

Then it fails, the Boolean array is full of False ...

What I'm doing wrong ? Is there a more elegant way to do this selection based on several columns ?

(Anyway, I want to avoid a for loop, because the real DataFrames I'm using are really big, so I'm looking for the fastest optimal way to do the job)

Covich
  • 2,544
  • 4
  • 26
  • 37

1 Answers1

1

Idea is create new DataFrame with all triple values and then merge with original DataFrame:

L = [("A",1,"one"),
     ("B",2,"two"),
     ("C",3,"three")]

df1 = pd.DataFrame(L, columns=['letter','number','word'])
print (df1)
  letter  number   word
0      A       1    one
1      B       2    two
2      C       3  three

df = DF.merge(df1)
print (df)
  letter  number   word
0      A       1    one
1      B       2    two
2      C       3  three

Another idea is create list of tuples, convert to Series and then compare by isin:

s = pd.Series(list(map(tuple, DF[['letter','number','word']].values.tolist())),index=DF.index)
df1 = DF[s.isin(L)]
print (df1)
  letter  number   word
0      A       1    one
4      B       2    two
8      C       3  three
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Alternative solution with a conversion to tuples worked ... but it is not really intuitive, a better solution for this common problem should be implemented by pandas' developers ;) – Covich May 06 '19 at 14:51