I have the following df
,
cluster_id amount inv_id inv_date
1 309.9 07121830990 2018-07-12
1 309.9 07121830990 2018-07-12
2 3130.0 20180501313000B 2018-05-01
2 3130.0 20180501313000B 2018-05-01
3 3330.50 201804253330.50 2018-04-25
3 3330.50 201804253330.50 2018-04-25
4 70.0 61518 2018-06-15
4 70.0 61518 2018-06-15
5 100.0 011318 2018-01-13
5 100.0 011318 2018-01-13
6 50.0 12202017 2017-12-20
6 50.0 12202017 2017-12-20
7 101.0 0000014482 2017-10-01
7 101.0 0000014482 2017-10-01
I want to create a boolean column dummy_inv_id
by groupby
cluster_id
, and set dummy_invoice_id
to True
if for each group,
1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date
or
2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date
First, I will remove any non-numerics chars from inv_id
and groupby
cluster_id
df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')
then convert amount
* 100 to string to facilitate matching
df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str)
e.g. 309.9
to '30990'
, 3130.0
to '313000'
, here I am wondering how to check the inv_id
ends with amount
here, and then how to check if the remaining part of inv_id
can be converted into datetime
and within +/-180 days of inv_date
, or if inv_id
can be directly converted to date. especially there are a few of date formats, i.e.
071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date
the result df
will look like,
cluster_id amount inv_id inv_date dummy_inv_id
1 309.9 07121830990 2018-07-12 True
1 309.9 07121830990 2018-07-12 True
2 3130.0 20180501313000B 2018-05-01 True
2 3130.0 20180501313000B 2018-05-01 True
3 3330.50 201804253330.50 2018-04-25 True
3 3330.50 201804253330.50 2018-04-25 True
4 70.0 61518 2018-06-15 True
4 70.0 61518 2018-06-15 True
5 100.0 011318 2018-01-13 True
5 100.0 011318 2018-01-13 True
6 50.0 12202017 2017-12-20 True
6 50.0 12202017 2017-12-20 True
7 101.0 0000014482 2017-10-01 False
7 101.0 0000014482 2017-10-01 False