3

I have the following dataframe:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2

I have the following process:

start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx = np.argmax(end_date_range.values > tra, axis=1)

df['window_start_dt'] = np.take(start_date_range, idx)
df['window_end_dt'] = end_date_range[idx]

However, I need to use np.where to fix an issue with df['window_start_dt'] with the following condition:

If 'transaction_dt' <= 'window_start_dt' then select the previous datetime value in start_date_range.

Pylander
  • 1,531
  • 1
  • 17
  • 36
  • seems like you know how to solve the problem. what's the hang up? what's not clear about the `numpy.where` documentation? is your question more about accessing previous rows of the dataframe? – Paul H Dec 12 '17 at 04:59
  • thanks, i am close now i think. i'm just not clear on how to use np.where to conditionally replace values in 'window_start_dt' from an array or list like start_date_range – Pylander Dec 12 '17 at 05:04
  • Luckily there's [online documentation of `numpy`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html) where you can look it up. – Daniel F Dec 12 '17 at 06:48

3 Answers3

2

I think you can use:

tra = df['transaction_dt'].values[:, None]
idx = np.argmax(end_date_range.values > tra, axis=1)

sdr = start_date_range[idx]
m = df['transaction_dt'] < sdr
#change value by condition with previous
df["window_start_dt"] = np.where(m, start_date_range[idx - 1], sdr)

df['window_end_dt'] = end_date_range[idx]
print (df)
    customer_id transaction_dt product  price  units window_start_dt  \
0             1     2004-01-02  thing1     25     47      2004-01-01   
1             1     2004-01-17  thing2    150      8      2004-01-01   
2             2     2004-01-29  thing2    150     25      2004-01-01   
3             3     2017-07-15  thing3     55     17      2017-06-21   
4             3     2016-05-12  thing3     55     47      2016-04-27   
5             4     2012-02-23  thing2    150     22      2012-02-18   
6             4     2009-10-10  thing1     25     12      2009-10-01   
7             4     2014-04-04  thing2    150      2      2014-03-09   
8             5     2008-07-09  thing2    150     43      2008-07-08   
9             5     2004-01-30  thing1     25     40      2004-01-01   
10            5     2004-01-31  thing1     25     22      2004-01-01   
11            5     2004-02-01  thing1     25      2      2004-01-31  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This is great now I see how to set up np.where in the future! There is just one last tweak. The comparison operator on line 5 should just '<' not '<='. Row 10 in the results ends up with a window of 2004-01-01 - 2004-02-29 otherwise for a transaction date of 2004-01-31. I'll accept after the tweak and reference from the other question as well. – Pylander Dec 12 '17 at 17:44
  • I am on phone only, so canniy change output. But `=` was removed. Thanks for comment. – jezrael Dec 12 '17 at 17:49
1

You can use numpy.where() like :

numpy.where(df['transaction_dt'] <= df['window_start_dt'], *operation when True*, *operation when False*)
Tanu
  • 1,503
  • 12
  • 21
0

What about something like this?

# get argmax indices
idx = df.transaction_dt.apply(lambda x: np.argmax(end_date_range > x)).values
# define window_start_dt
df = df.assign(window_start_dt = start_date_range[idx])

# identify exceptions
mask = df.transaction_dt.le(df.window_start_dt)
# replace with shifted start_date_rage
df.loc[mask, "window_start_dt"] = start_date_range[idx - 1][mask]

Output:

    customer_id transaction_dt product  price  units window_start_dt
0             1     2004-01-02  thing1     25     47      2004-01-01
1             1     2004-01-17  thing2    150      8      2004-01-01
2             2     2004-01-29  thing2    150     25      2004-01-01
3             3     2017-07-15  thing3     55     17      2017-06-21
4             3     2016-05-12  thing3     55     47      2016-04-27
5             4     2012-02-23  thing2    150     22      2012-02-18
6             4     2009-10-10  thing1     25     12      2009-10-01
7             4     2014-04-04  thing2    150      2      2014-03-09
8             5     2008-07-09  thing2    150     43      2008-07-08
9             5     2004-01-30  thing1     25     40      2004-01-01
10            5     2004-01-31  thing1     25     22      2004-01-01
11            5     2004-02-01  thing1     25      2      2004-01-31
andrew_reece
  • 20,390
  • 3
  • 33
  • 58