1

I am struggling to get this code that I wrote to work. I know it is probably an easy fix but I can't seem to get it to work correctly. In essence, what I want is to create a boolean mask on a pandas dataframe that returns only rows where the values in "Actual Manufacturer" or "Actual Collection" exist in "PqaQuestion". It works well with one set of criteria but adding in multiple criteria messes things up a bit. I can't seem to get the "or" operator in there without triggering the userwarning: boolean series key will be reindexed to match dataframe index, which does mess up the output. If anyone could help fix this issue but also help me understand why this happens I would greatly appreciate it. I've seen other posts about the subject but none that explain it and I can't seem to tailor other posts to my own situation.

names= ['PqaPrSKU', 'PrName', 'White Label Manufacturer', 'White Label Collection', 'Actual Manufacturer', 'Actual MaID', 'Actual Collection', 'PqaID', 'PqaQuestion', 'UpdatedQuestion', 'PanID', 'PanAnswer', 'UpdatedAnswer', 'DateAdded', 'PrBclgID']


def match_function(column1_value, column2_value, column3_value):
     return (column2_value is not None) and (column1_value is not None) and (column3_value is not None) and (str(column2_value).lower() in str(column1_value).lower()) or (str(column3_value).lower() in str(column1_value).lower())


import pandas as pd
df = pd.read_csv('Bucket61(8.22).csv', names= names, skipinitialspace=True, skiprows=1)
#print(df.from_records(data))

indexer = df.apply(lambda row: match_function(row["PqaQuestion"], row["Actual Collection"], row["Actual Manufacturer"]), axis=1)


filtered_df = df[indexer]

print(filtered_df[indexer])
#print(df[indexer])
from pandas import ExcelWriter

writer = ExcelWriter('ScrubbedQATemplate.xlsx')
filtered_df.to_excel(writer, 'Sheet1')
writer.save()
jpp
  • 159,742
  • 34
  • 281
  • 339
brisenburg
  • 11
  • 3
  • You should provide some sample data and your expected output in the form of a [mcve]. Right now you are using `apply(axis=1)` and this can and should be avoided for this kind of operation. – ALollz Oct 01 '18 at 18:03
  • As per ALollz' comment, it's hard to see what exactly you're trying to do. In any case, `or` won't work element-by-element. If you just use `|` (the element-wise OR operator) instead of `or`, you might get it to work. – Joooeey Oct 01 '18 at 18:06
  • Possible duplicate of [how do you filter pandas dataframes by multiple columns](https://stackoverflow.com/questions/22086116/how-do-you-filter-pandas-dataframes-by-multiple-columns) – Joooeey Oct 01 '18 at 18:08
  • Did an answer below help? If so, feel free to [accept](https://stackoverflow.com/help/accepted-answer), or ask for clarification. – jpp Oct 03 '18 at 22:51

1 Answers1

0

You haven't explained precisely what match_function is trying to achieve, but pd.DataFrame.apply should be avoided wherever possible. It's nothing more than a thinly veiled loop.

Let's instead try and convert criteria in match_function as a Boolean series mask:

def match_function(column1_value, column2_value, column3_value):
     return (column2_value is not None) and (column1_value is not None) and \
            (column3_value is not None) and \
            (str(column2_value).lower() in str(column1_value).lower()) or \
            (str(column3_value).lower() in str(column1_value).lower())

# match_function(row["PqaQuestion"], row["Actual Collection"], row["Actual Manufacturer"])

Here's one attempt:

cols = ['PqaQuestion', 'Actual Collection', 'Actual Manufacturer']

A = df[cols].astype(str).values

m1 = df[cols].notnull().all(1)
m2 = np.array([j.lower() in i.lower() for i, j, k in A])
m3 = np.array([k.lower() in i.lower() for i, j, k in A])

filtered_df = df[m1 & (m2 | m3)]

A few points to note:

  1. We use vectorised operations via & / | rather than scalar operations and / or which would be applied with row-wise operations.
  2. m1 combines your first 3 conditions into one via pd.DataFrame.all.
  3. Boolean conditions in Pandas series and NumPy arrays can be combined via & / | operators.
jpp
  • 159,742
  • 34
  • 281
  • 339
  • this works great. Thank you so much for the help! your right, df.apply seem to be giving me too much of a headache for its worth. This seems much more solid – brisenburg Oct 02 '18 at 18:45