New python programmer here. Sample data is shown below. I have a dataframe that has < 1,000 rows. That df came from reading an Excel workbook. The "Store" column has multiple kinds of data: 1) a store number, 2) Bank charges, 3) Other kinds of transactions and 4) Exceptions. How should I approach this?
Specifically,do I try to filter the data into separate df's based on type?
OR: do I pass the initial df one time and set a category flag to indicate the category and then use that column to segregate the data. I've gotten the 1): data is ambiguous error: a.any() (etc) 2): error : can't do 'str' & 'str', etc, 3) using if constructs,. passing the dataset inside a for loop, checking for 2 conditions (for stores for example -store len <=8 & Bank Acct isna) based on that test set values in a new column - error is that is universally creates a CATEGORY column and then populates 100% of the entries as 'Store'. I'm doing something wrong and I've looked at it so much now I can't see the forest for the trees kind of thing.
Sample data: Code that creates the sample data
df = pd.DataFrame.from_dict(
{
'Store': ['Bank Fees', 'Bt12600', 'Bt12300', 'Something Else', 'AZ1001', 'TX2002','GA5009'],
'Bank Acct': ['B12343', 'B12344', '', 'B12345','', '', 'B1238'],
'Amount': [1000.00, 2000.00, 1500.00, 2500.00, 55.00, 3000.00, 3500.00],
}
)
df['Store Length'] = df['Store'].apply(len)
df['Store Length'] = df['Store Length'].apply(str) # for filtering on STORES only. They ALL have < 8 characters
df = df.replace('', np.nan) # Set all blank values as NaN
df['Included'] = "No" # default value - change to "YES" as rows are categorized ( STORE | BANK | OTHER | XCPT)
df['Category'] = 'Exception' # default value - set for all then change by row as pciked off XCPT means Exception
print(df)
Sample Data:
Store Bank Acct Amount Store Length Included Category
0 Bank Fees B12343 1000.0 9 No Exception
1 Bt12600 B12344 2000.0 7 No Exception
2 Bt12300 NaN 1500.0 7 No Exception
3 Something Else B12345 2500.0 14 No Exception
4 AZ1001 NaN 55.0 6 No Exception
5 TX2002 NaN 3000.0 6 No Exception
6 GA5009 B1238 3500.0 6 No Exception
Here is some of what I've tried:
cond1 = (df.loc[(df['Store Length'].isin(['6','7']))])
cond2 = (df.loc[(df['Bank Acct'].isna())])
# Also tried:
df.loc[(df['Store Length'] <= 8) & (df['Bank Acct'].isna())]
print(df)
throws this error:
TypeError: '<=' not supported between instances of 'str' and 'int
I've gotten several variations on this: str & str; str & bool; etc.
I've got a sneaking suspicion I'm not using loc right above? Yes/No - why?
I'm probably doing something wrong that will be simple to fix, but I've watched videos, read Google articles, etc. I really have tried to be self sufficient.
if ((cond1) and (cond2)): # Store length & Bank Acct: Len <8 chars & NaN for the Bank Acct
df['Category'] = 'STORE' # change the Category from Exception to STORE
df['Included'] = 'YES' # change the included flag to YES
pass
I can then come back and "pick off" the STORES separate and distinct from the BANK Fees, OTHER , etc. either as separate dataframes or otherwise.
Throws this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thanks for sharing your knowledge and helping a noob! It is very much appreicated.