How to reduce the execution time of multiple models' inference on a large dataset in pyspark?

Question

I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code:

import tensorflow as tf
from tensorflow import keras
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql import functions as F

def inference(df):

    col_names = [c for c in df.columns]

    @pandas_udf(returnType=T.DoubleType())
    def predict_pandas_udf(*features):
        X = pd.concat(features, axis=1)
        X.columns = col_names

        # Create a dataset
        ds = tf.data.Dataset.from_tensor_slices(dict(X))
        ds = ds.batch(1024)

       # Get model
        nn = keras.models.load_model(f'all_models/model_1')

        prob = nn.predict(ds)
        return pd.Series(prob[:, 0])

    df = df. \
            withColumn('probability', predict_pandas_udf(*(F.col(c) for c in df.columns )))
           
    return df



df=spark.table('p13n_features_explore.ss_ltr_low_confidence_pairs_with_features_amalgam')

scores=inference(df)
   
scores.write.mode('overwrite').parquet(f'gs://p13n-storage2/data/features/smart_subs/lcp_pairs_amalgam')

This code's entire execution, that is, till writing to a data path, is taking around 22 minutes. I want to have ten models ensemble for it, i.e. inferencing ten models on it and taking the mean of the scores in the final, like the code below.

import tensorflow as tf
from tensorflow import keras
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql import functions as F

def inference(df):

    col_names = [c for c in df.columns]

    @pandas_udf(returnType=T.DoubleType())
    def predict_pandas_udf(*features):
        X = pd.concat(features, axis=1)
        X.columns = col_names

        # Create a dataset
        ds = tf.data.Dataset.from_tensor_slices(dict(X))
        ds = ds.batch(1024)

        # Get model
        prob_scores_list=[]
        for i in range(1,11):
            nn = keras.models.load_model(f'all_models/model_{i}')
            prob = nn.predict(ds)
            prob_scores_list.append(prob[:, 0])
           
        prob_scores = np.array(prob_scores_list)    
        return pd.Series(prob_scores.mean(axis=0))

    df = df. \
            withColumn('probability', predict_pandas_udf(*(F.col(c) for c in df.columns )))
           
    return df



df=spark.table('p13n_features_explore.ss_ltr_low_confidence_pairs_with_features_amalgam')

scores=inference(df)
   
scores.write.mode('overwrite').parquet(f'gs://p13n-storage2/data/features/smart_subs/lcp_pairs_amalgam')

But this entire execution takes a massive amount of time; it takes around 5 hours. How can I make it executable in less than an hour? Can we remove the loop and somehow vectorize this implementation?

What are the various methods for making the multiple module ensemble inferencing on enormous data in pyspark faster?

Thanks.

How to reduce the execution time of multiple models' inference on a large dataset in pyspark?

0 Answers0