1

I am trying to predict the probabilities for a customer reordering an order cart from their order history by making use of Accelerated Failure Time model in PySpark. My input data contains

  • various features of the customer and the respective order cart as predictors
  • days between two consecutive orders as label and
  • previously observed orders as uncensored and future orders as censored.

PySpark is the choice here as there are some restrictions on the environment and I don't have another alternative to process the huge volumes of order history(~40 GB). Here's my sample implementation:

> from pyspark.ml.regression import AFTSurvivalRegression from
> pyspark.ml.linalg import Vectors
> 
> training = spark.createDataFrame([
>     (1,1.218, 1.0, Vectors.dense(1.560, -0.605)),
>     (1,2.949, 0.0, Vectors.dense(0.346, 2.158)),
>     (2,3.627, 0.0, Vectors.dense(1.380, 0.231)),
>     (2,0.273, 1.0, Vectors.dense(0.520, 1.151)),
>     (3,4.199, 0.0, Vectors.dense(0.795, -0.226))], ["customer_id","label", "censor", "features"]) aft =
> AFTSurvivalRegression()
> 
> model = aft.fit(training)

Questions:

  1. Does AFTSurvivalRegression method in pyspark.ml.regression have the ability to cluster the records in my dataset based on the customer id? If so, please explain how to implement?
  2. The desired output would contain the probabilities of a particular customer reordering different order-carts. How can I obtain these values by extending my code implementation?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Rajesh
  • 11
  • 1

0 Answers0