I am trying to predict the probabilities for a customer reordering an order cart from their order history by making use of Accelerated Failure Time model in PySpark. My input data contains
- various features of the customer and the respective order cart as predictors
- days between two consecutive orders as label and
- previously observed orders as uncensored and future orders as censored.
PySpark is the choice here as there are some restrictions on the environment and I don't have another alternative to process the huge volumes of order history(~40 GB). Here's my sample implementation:
> from pyspark.ml.regression import AFTSurvivalRegression from
> pyspark.ml.linalg import Vectors
>
> training = spark.createDataFrame([
> (1,1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (1,2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (2,3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (2,0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (3,4.199, 0.0, Vectors.dense(0.795, -0.226))], ["customer_id","label", "censor", "features"]) aft =
> AFTSurvivalRegression()
>
> model = aft.fit(training)
Questions:
- Does AFTSurvivalRegression method in pyspark.ml.regression have the ability to cluster the records in my dataset based on the customer id? If so, please explain how to implement?
- The desired output would contain the probabilities of a particular customer reordering different order-carts. How can I obtain these values by extending my code implementation?