1

I have a problem. I want to predict when the customer will place another order in how many days if an order comes in. I have already created my target variable next_purchase_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this.

Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 days. For example, I have calculated back from today's date how many orders the customer has placed in the last 90 days.

Is it better to say per row how many orders the customer has placed? Please see below for the example.

So does it make more sense to calculate this from today's date and include it as a feature or should it be recalculated for each row?

    customerId    fromDate next_purchase_in_days
0            1  2021-02-22         24
1            1  2021-03-18         4
2            1  2021-03-22         109
3            1  2021-02-10         12
4            1  2021-09-07         133
8            3  2022-05-17         61
10           3  2021-02-22         133
11           3  2021-02-22         133

Example

# What I have
    customerId    fromDate next_purchase_in_days purchase_in_last_90_days
0            1  2021-02-22         24            0
1            1  2021-03-18         4             0
2            1  2021-03-22         109           0
3            1  2021-02-10         12            0
4            1  2021-09-07         133           0
8            3  2022-05-17         61            1
10           3  2021-02-22         133           1
11           3  2021-02-22         133           1

# Or does this make more sense?
    customerId    fromDate next_purchase_in_days purchase_in_last_90_days
0            1  2021-02-22         24            1
1            1  2021-03-18         4             2
2            1  2021-03-22         109           3
3            1  2021-02-10         12            0
4            1  2021-09-07         133           0
8            3  2022-05-17         61            1
10           3  2021-02-22         133           0
11           3  2021-02-22         133           0
Test
  • 571
  • 13
  • 32
  • 1
    Yes you are totally correct. (the first thing you're doing). You should consider many variations of this type of metric. #1 thing people forget, is that for that all rows where the window you choose (90 days) goes beyond the first available date, you'll want to exclude altogether. The biggest challenge you'll have is more on the modeling side - how to choose your partition strategy. You can choose between group partitioning on customerID (because this is repeated measures) or you can choose date partitioning (so validation is always future) - this is not a clearcut decision, depends on data – Josh May 31 '22 at 21:30
  • 1
    The other challenge you have here, is censoring on the right side. You seem to assume everyone orders again, which might not be correct. – Josh May 31 '22 at 21:32
  • You can try to take sections for example the last 5, then 10, then 15 and so on days and check on the test data which step is better. Another option is to submit several options with different steps. Perhaps you have an age, gender, profession, insurance, income level and other things that you can use. If there is too much data and there is little useful information in them, then you can filter out the importance of features, that is, reduce the dimension. Since on noisy data, the model can find patterns where there are none. – inquirer Jun 03 '22 at 09:49

1 Answers1

1

You can address this in a number of ways, but something interesting to consider is the interaction between Date & Customer ID.

Dates have meaning to humans beyond just time keeping. They are associated with emotional, and culturally importance. Holidays, weekends, seasons, anniversaries etc. So there is a conditional relationship between the probability of a purchase and Events: P(x|E)

Customer Ids theoretically represent a single person, or at the very least a single business with a limited number of people responsible for purchasing. Certain people/corporations are just more likely to spend.

So here are a number of ways to address this:

  1. Find a list of holidays relevant to the users. For instance if they are US based find a list of US recognized holidays. Then create a feature based on each date: Date_Till_Next_Holiday or (DTNH for short).

  2. Dates also have cyclical aspects that can encode probability. Day of the > year (1-365), Days of the week (1-7), week numbers (1-52), Months (1-12), Quarters (1-4). I would create additional columns encoding each of these.

  3. To address the customer interaction, have a running total of past purchases. You could call it Purchases_to_date, and would be an integer (0...n) where n is the number of previous purchases. I made a notebook to show you how to do running totals.

  4. Humans tend to share purchasing patterns with other humans. You could run a k-means cluster algorithm that splits customers into 3-4 groups based on all the previous info, and then use their cluster-number as a feature. Sklearn-Kmeans

So based on all that you could engineer 8 different columns. I would then run Principle Component Analysis (PCA) to reduce that to 3-4 features. You can use Sklearn-PCA to do PCA.

J.Kent
  • 163
  • 5
  • 1
    Running PCA here might be an overkill. I think more better would to understand the impact of each of the features playing to the target using variable importance or through the regression analysis. – coldy Jun 04 '22 at 23:08