Accelerated Failure Time model in PySpark to model recurring events

Question

I am trying to predict the probabilities for a customer reordering an order cart from their order history by making use of Accelerated Failure Time model in PySpark. My input data contains

various features of the customer and the respective order cart as predictors
days between two consecutive orders as label and
previously observed orders as uncensored and future orders as censored.

PySpark is the choice here as there are some restrictions on the environment and I don't have another alternative to process the huge volumes of order history(~40 GB). Here's my sample implementation:

> from pyspark.ml.regression import AFTSurvivalRegression from
> pyspark.ml.linalg import Vectors
> 
> training = spark.createDataFrame([
>     (1,1.218, 1.0, Vectors.dense(1.560, -0.605)),
>     (1,2.949, 0.0, Vectors.dense(0.346, 2.158)),
>     (2,3.627, 0.0, Vectors.dense(1.380, 0.231)),
>     (2,0.273, 1.0, Vectors.dense(0.520, 1.151)),
>     (3,4.199, 0.0, Vectors.dense(0.795, -0.226))], ["customer_id","label", "censor", "features"]) aft =
> AFTSurvivalRegression()
> 
> model = aft.fit(training)

Questions:

Does AFTSurvivalRegression method in pyspark.ml.regression have the ability to cluster the records in my dataset based on the customer id? If so, please explain how to implement?
The desired output would contain the probabilities of a particular customer reordering different order-carts. How can I obtain these values by extending my code implementation?

Accelerated Failure Time model in PySpark to model recurring events

0 Answers0