Pandas: set column value based on other columns: Filtering Data and changing column values

Question

New python programmer here. Sample data is shown below. I have a dataframe that has < 1,000 rows. That df came from reading an Excel workbook. The "Store" column has multiple kinds of data: 1) a store number, 2) Bank charges, 3) Other kinds of transactions and 4) Exceptions. How should I approach this?

Specifically,do I try to filter the data into separate df's based on type?

OR: do I pass the initial df one time and set a category flag to indicate the category and then use that column to segregate the data. I've gotten the 1): data is ambiguous error: a.any() (etc) 2): error : can't do 'str' & 'str', etc, 3) using if constructs,. passing the dataset inside a for loop, checking for 2 conditions (for stores for example -store len <=8 & Bank Acct isna) based on that test set values in a new column - error is that is universally creates a CATEGORY column and then populates 100% of the entries as 'Store'. I'm doing something wrong and I've looked at it so much now I can't see the forest for the trees kind of thing.

Sample data: Code that creates the sample data

df = pd.DataFrame.from_dict(
    {
        'Store': ['Bank Fees', 'Bt12600', 'Bt12300', 'Something Else', 'AZ1001', 'TX2002','GA5009'],
        'Bank Acct': ['B12343', 'B12344', '', 'B12345','', '', 'B1238'],
        'Amount': [1000.00, 2000.00, 1500.00, 2500.00, 55.00, 3000.00, 3500.00],
    }
) 

df['Store Length'] = df['Store'].apply(len)
df['Store Length'] = df['Store Length'].apply(str)  # for filtering on STORES only.  They ALL have < 8 characters
df = df.replace('', np.nan)                         #  Set all blank values as NaN

df['Included'] = "No"         # default value - change to "YES" as rows are categorized ( STORE | BANK | OTHER | XCPT)
df['Category'] = 'Exception'  # default value - set for all then change by row as pciked off  XCPT means Exception
print(df)

Sample Data:

            Store Bank Acct  Amount Store Length Included   Category
0       Bank Fees    B12343  1000.0            9       No  Exception
1         Bt12600    B12344  2000.0            7       No  Exception
2         Bt12300       NaN  1500.0            7       No  Exception
3  Something Else    B12345  2500.0           14       No  Exception
4          AZ1001       NaN    55.0            6       No  Exception
5          TX2002       NaN  3000.0            6       No  Exception
6          GA5009     B1238  3500.0            6       No  Exception

Here is some of what I've tried:

cond1 = (df.loc[(df['Store Length'].isin(['6','7']))])  
cond2 = (df.loc[(df['Bank Acct'].isna())])

#  Also tried:
df.loc[(df['Store Length'] <= 8) & (df['Bank Acct'].isna())]

print(df)

throws this error:

TypeError: '<=' not supported between instances of 'str' and 'int

I've gotten several variations on this: str & str; str & bool; etc.

I've got a sneaking suspicion I'm not using loc right above? Yes/No - why?

I'm probably doing something wrong that will be simple to fix, but I've watched videos, read Google articles, etc. I really have tried to be self sufficient.

if ((cond1) and (cond2)):     #  Store length & Bank Acct:  Len <8 chars & NaN  for the Bank Acct
    df['Category'] = 'STORE'  # change the Category from Exception to STORE
    df['Included'] = 'YES'    # change the included flag to YES
    pass

I can then come back and "pick off" the STORES separate and distinct from the BANK Fees, OTHER , etc. either as separate dataframes or otherwise.

Throws this error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks for sharing your knowledge and helping a noob! It is very much appreicated.

what is it that you are trying to do with this dataset? why are you concerned with the store column? — Shirin Yavari, Mar 24 '23 at 16:53
The end result of this is to produce a Journal Entry for uploading to an accounting system. The store column holds a value thatis different from the same store in the accoutning system. So basically what I'm trying to do is to take a hodgepodege of data and put it into buckets of which one is the store. Then I use those buckets to prepare and process a csv file with a certain set of columns, output it and the accounting dapartment imports it into their system. I take a day or two's work, each month, and do it in less than a minute. Big productivity improvement. Make sense? — Fred, Mar 24 '23 at 19:02

Michael Cao · Answer 1 · 2023-03-26T04:17:53.967

There is a lot to unpack here. I will try to address some of the issues and will update the answer based on your feedback.

First, some typing issues. You are currently storing the Store Length column as a string. This causes issues when trying to do numerical comparisons which is why you are getting the TypeError: '<=' not supported between instances of 'str' and 'int. I recommend using this to calculate the Store Length column.

df['Store Length'] = df['Store'].str.len()

The other sorta typing issue is the df['Bank Acct'].isna(). That method does not treat empty strings as null or n/a. Whether an empty string pulled from excel gets stored as an empty string or None depends on how you read the excel. In your current case, I would recommend

df['Bank Acct'].str.len() > 0

Second, your conditions. The reason you're running into ValueError:The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() is that if statements expect the condition to be simply true or false. However, if you are checking a list of possible inequalities, you essentially end up some something like

if [True, True, False, True, ...]:

and the code doesn't know whether to proceed as the condition is True or False.

There are a number of ways to address this. The simplest, but not as efficient way is to go through each row and do the comparison.

for index, row in df.iterrows():
   if row['Store Length'] <= 8 and row['Bank Acct'].str.len() > 0:
       # Do something

Thanks for the quick response. I really wasn't expecting that. In my "production" (ie test) code I have this:df['Store Length'] = df['Store'].apply(len) df['Store Length'] = df['Store Length'].apply(str) df = df.replace('', np.nan) — Fred, Mar 24 '23 at 17:09
I made the changes and now am getting the "ambiguous truth value error". I also created a Jupyter notebook and placed in at: https://github.com/fboothe/py-pub/blob/main/FLAP%20Category.ipynb. I've got to figure out how to get the stack loads formatted like everyone else does - like in columns, etc. I should be a little easier to read from there until I can get this fixed. Again, my thanks and if you do such things, how about letting me get you a cup at starbucks to express my appriciation? — Fred, Mar 24 '23 at 17:40
Sorry Michael - had a brain cramp apparently. I'm still trying to get this to work. I'll get back to you soon. Thanks for helping. Fred — Fred, Mar 27 '23 at 20:41

Pandas: set column value based on other columns: Filtering Data and changing column values

1 Answers1