I have a csv
with over 600 columns and thousands of rows. The original file contains more customers and departments, but this example includes the critical pieces.
Note: I derived the Site
column from A_Loc1
and B_Loc1
columns, in order to more easily compare and group the rows, but this is not a requirement. If the groupby can be performed without this, I am open to other approaches.
I need to compare dates from different rows and columns, based on the Cust_ID
and Site
. So for example, confirm A_Date1
is less than B_Date1
, but only for the same Cust_ID
and Site
value.
So for Cust_ID
100
and Site
CA2.2
, A_Date1
is 8/1/2015
and B_Date1
is 6/15/2018
:
if A_Date1 > B_Date1:
df['Result'] = "Fail"
else:
result = ""
In the above case, no action is needed because A_Date1
is less than B_Date1
.
However, for Cust_ID
100
and Site
CA2.0
, the A_Date1
is 7/1/2019
and B_Date1
is 12/15/2018
, so the Result
column should be Fail
for Dep B
rows where Site
is CA2.0
.
I am open to performing this with any efficient, flexible approach, however, there are other comparisons I will need to perform with different rows and columns, but this should get me started.
Expected Results:
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
| | Result | Cust_ID | Dep | Order_Num | Site | Rec_Date1 | Rec_DateX | A_Date1 | A_Loc1 | A_DateX | B_Loc1 | B_Date1 | B_Date2 | B_DateX | C_Date1 | C_Loc1 | C_DateX |
|----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------|
| 0 | | 100 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 1 | | 100 | A | 2 | CA2.0 | | | 7/1/2019 | CA2.0 | 8/21/2019 | | | | | | | |
| 2 | | 100 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 3 | Fail | 100 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 4 | Fail | 100 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 5 | | 100 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 6 | | 100 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 7 | | 100 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
| 8 | | 200 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 9 | | 200 | A | 2 | CA2.0 | | | 7/1/2015 | CA2.0 | 8/21/2019 | | | | | | | |
| 10 | | 200 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 11 | | 200 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 12 | | 200 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 13 | | 200 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 14 | | 200 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 15 | | 200 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
What I've tried:
# Returns: ValueError: Length of values does not match length of index
df['Result'] = df.loc[df.A_Date1 < df.B_Date1].groupby(['Cust_ID','Site'],as_index=False)
# Returns: ValueError: Length of values does not match length of index
df["Result"] = df.loc[(((df["A_Date1"] != "N/A")
& (df["B_Date1"] != "N/A"))
& (df.A_Date1 < df.B_Date1))].groupby([
'Cust_ID','Site'],as_index=False)
# Returns: ValueError: unknown type str224
conditions = "(x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1'])"
df["Result"] = df.groupby(['Cust_ID','Site']).apply(lambda x: pd.eval(conditions))
# TypeError: incompatible index of inserted column with frame index
df = df[df.Dep != 'Rec']
df['Result'] = df.groupby(['Cust_ID','Site'],as_index = False).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
# This produces FALSE for all rows
grouped_df = df.groupby(['Cust_ID','Site']).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
Update:
I've figured out the solution for these two specific columns (A_Loc1
and B_Loc1
). First by converting these columns to datetime
, adding the Result
column, grouping and performing the comparison.
However, I have approximately 50 columns in my original file I need to compare. It would be ideal to iterate over a list of columns (or dictionary) to perform these steps.
## Solution for A_Loc1 and B_Loc1
## Convert all date columns to datetime, replace with NaN if error
df['A_Date1'] = pd.to_datetime(df['A_Date1'], errors ="coerce")
df['B_Date1'] = pd.to_datetime(df['B_Date1'], errors ="coerce")
# Add Result column
df.insert(loc=0, column="Result", value=np.nan)
# groupby Cust_ID and Site, then fill A_Date1 forward and back
df['A_Date1'] = df.groupby(['Cust_ID','Site'], sort=False)['A_Date1'].apply(lambda x: x.ffill().bfill())
# Perform comparison
df.loc[(((df["A_Date1"].notna()) & (df["B_Date1"].notna()))
& ((df["A_Date1"]) > (df["B_Date1"]))),
"Result"] = "Fail"