1

I have the following df,

cluster_id    amount    inv_id            inv_date
1             309.9     07121830990       2018-07-12
1             309.9     07121830990       2018-07-12
2             3130.0    20180501313000B   2018-05-01
2             3130.0    20180501313000B   2018-05-01
3             3330.50   201804253330.50   2018-04-25             
3             3330.50   201804253330.50   2018-04-25
4             70.0      61518             2018-06-15
4             70.0      61518             2018-06-15
5             100.0     011318            2018-01-13
5             100.0     011318            2018-01-13
6             50.0      12202017          2017-12-20
6             50.0      12202017          2017-12-20
7             101.0     0000014482        2017-10-01
7             101.0     0000014482        2017-10-01

I want to create a boolean column dummy_inv_id by groupby cluster_id, and set dummy_invoice_id to True if for each group,

1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date

or

2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date 

First, I will remove any non-numerics chars from inv_id and groupby cluster_id

df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')

then convert amount * 100 to string to facilitate matching

df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str) 

e.g. 309.9 to '30990', 3130.0 to '313000', here I am wondering how to check the inv_id ends with amount here, and then how to check if the remaining part of inv_id can be converted into datetime and within +/-180 days of inv_date, or if inv_id can be directly converted to date. especially there are a few of date formats, i.e.

071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date 

the result df will look like,

cluster_id    amount    inv_id            inv_date      dummy_inv_id
1             309.9     07121830990       2018-07-12    True
1             309.9     07121830990       2018-07-12    True
2             3130.0    20180501313000B   2018-05-01    True
2             3130.0    20180501313000B   2018-05-01    True
3             3330.50   201804253330.50   2018-04-25    True          
3             3330.50   201804253330.50   2018-04-25    True
4             70.0      61518             2018-06-15    True
4             70.0      61518             2018-06-15    True
5             100.0     011318            2018-01-13    True
5             100.0     011318            2018-01-13    True
6             50.0      12202017          2017-12-20    True
6             50.0      12202017          2017-12-20    True
7             101.0     0000014482        2017-10-01    False
7             101.0     0000014482        2017-10-01    False
daiyue
  • 7,196
  • 25
  • 82
  • 149

1 Answers1

1

Idea is create helper dictionary with possible formats of datetimes with number of letters for slicing and in list comprehension converting - errors='coerce' create NaTs for not matched values:

from functools import reduce

#add zeros to length 6
s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]

But some formats should convert bad, so these datetimes outside of range convert to NaT:

L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

And combine all non NaT values togther by Series.combine_first:

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0    2018-07-12
1    2018-07-12
2    2018-05-01
3    2018-05-01
4    2018-04-25
5    2018-04-25
6    2018-06-15
7    2018-06-15
8    2018-01-13
9    2018-01-13
10   2017-12-20
11   2017-12-20
12          NaT
13          NaT
Name: inv_id, dtype: datetime64[ns]

Last check +-180 days:

df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))

print (df)
    cluster_id  amount           inv_id    inv_date    new
0            1   309.9      07121830990  2018-07-12   True
1            1   309.9      07121830990  2018-07-12   True
2            2  3130.0  20180501313000B  2018-05-01   True
3            2  3130.0  20180501313000B  2018-05-01   True
4            3  3330.5  201804253330.50  2018-04-25   True
5            3  3330.5  201804253330.50  2018-04-25   True
6            4    70.0            61518  2018-06-15   True
7            4    70.0            61518  2018-06-15   True
8            5   100.0           011318  2018-01-13   True
9            5   100.0           011318  2018-01-13   True
10           6    50.0         12202017  2017-12-20   True
11           6    50.0         12202017  2017-12-20   True
12           7   101.0       0000014482  2017-10-01  False
13           7   101.0       0000014482  2017-10-01  False

EDIT:

Added solution for remove substrings from ends:

import re
from functools import reduce

df['amt_str'] = (df['amount']*100).round().astype(int).astype(str) 
df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str
0            1   309.9      07121830990  2018-07-12   30990      071218
1            1   309.9      07121830990  2018-07-12   30990      071218
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425
6            4    70.0            61518  2018-06-15    7000      061518
7            4    70.0            61518  2018-06-15    7000      061518
8            5   100.0           011318  2018-01-13   10000      011318
9            5   100.0           011318  2018-01-13   10000      011318
10           6    50.0         12202017  2017-12-20    5000    12202017
11           6    50.0         12202017  2017-12-20    5000    12202017
12           7   101.0       0000014482  2017-10-01   10100  0000014482
13           7   101.0       0000014482  2017-10-01   10100  0000014482

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str    new
0            1   309.9      07121830990  2018-07-12   30990      071218   True
1            1   309.9      07121830990  2018-07-12   30990      071218   True
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
6            4    70.0            61518  2018-06-15    7000      061518   True
7            4    70.0            61518  2018-06-15    7000      061518   True
8            5   100.0           011318  2018-01-13   10000      011318   True
9            5   100.0           011318  2018-01-13   10000      011318   True
10           6    50.0         12202017  2017-12-20    5000    12202017   True
11           6    50.0         12202017  2017-12-20    5000    12202017   True
12           7   101.0       0000014482  2017-10-01   10100  0000014482  False
13           7   101.0       0000014482  2017-10-01   10100  0000014482  False
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • the solution is almost perfect, but it has not considered `amount` in `df`. and within +/-180 days of `inv_date`, i.e. `df['dummy_invoice_id'] = s2.between(df['inv_date'] - pd.Timedelta(180, unit='d'), df['inv_date'] + pd.Timedelta(180, unit='d'))` – daiyue Aug 31 '18 at 14:17
  • since `inv_id` value like `100117` can be inferred as `2017-10-01` and `2017-01-10`, and using `combine_first` will only preserve `2017-01-10` according to the order of `formats` dict defined, so I am wondering it is possible to keep all inferred non-NaT dates, which can then be checked if any of the dates is within +/- 180 days of the `inv_date`, if so set `new` to `True`, and `False` otherwise; – daiyue Sep 04 '18 at 16:39
  • 1
    @daiyue - Hmmm, not really easy. `combine_first` working first not NaN values is matched. So ordering of formats values is necessary. In my solution it is dictionary, so define ordering is under python 3.6 not possible. So first change it to list of tuples like `formats = [('%m%d%y',6), ('%d%m%y',6), ('%y%m%d',6),('%Y%m%d',8),('%m%d%Y',8)]` and `L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats]`. – jezrael Sep 05 '18 at 06:00
  • 1
    @daiyue - So if `100117` then is matched first value in formats, `('%m%d%y',6)`. And if change ordering - `formats = [('%d%m%y',6), ('%m%d%y',6), ('%y%m%d',6),('%Y%m%d',8),('%m%d%Y',8)]` then is matched `('%d%m%y',6)`. Unfortunataly is not possible says what datetime is correct. Need 2 times run solution with swapped ordered values and check twice. If still problem let me know. – jezrael Sep 05 '18 at 06:04
  • 1
    @daiyue - The worst scenariou is general if many ambiguous dates. possible solution should be filtering them to separate DataFrame and use some apply with many if-elif-else, find dates and then append back to main df. – jezrael Sep 05 '18 at 06:13
  • created a new post at https://stackoverflow.com/questions/52183021/pandas-how-to-create-a-boolean-column-based-on-other-boolean-columns-in-the-df – daiyue Sep 05 '18 at 10:55