I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code:
import tensorflow as tf
from tensorflow import keras
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
def inference(df):
col_names = [c for c in df.columns]
@pandas_udf(returnType=T.DoubleType())
def predict_pandas_udf(*features):
X = pd.concat(features, axis=1)
X.columns = col_names
# Create a dataset
ds = tf.data.Dataset.from_tensor_slices(dict(X))
ds = ds.batch(1024)
# Get model
nn = keras.models.load_model(f'all_models/model_1')
prob = nn.predict(ds)
return pd.Series(prob[:, 0])
df = df. \
withColumn('probability', predict_pandas_udf(*(F.col(c) for c in df.columns )))
return df
df=spark.table('p13n_features_explore.ss_ltr_low_confidence_pairs_with_features_amalgam')
scores=inference(df)
scores.write.mode('overwrite').parquet(f'gs://p13n-storage2/data/features/smart_subs/lcp_pairs_amalgam')
This code's entire execution, that is, till writing to a data path, is taking around 22 minutes. I want to have ten models ensemble for it, i.e. inferencing ten models on it and taking the mean of the scores in the final, like the code below.
import tensorflow as tf
from tensorflow import keras
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
def inference(df):
col_names = [c for c in df.columns]
@pandas_udf(returnType=T.DoubleType())
def predict_pandas_udf(*features):
X = pd.concat(features, axis=1)
X.columns = col_names
# Create a dataset
ds = tf.data.Dataset.from_tensor_slices(dict(X))
ds = ds.batch(1024)
# Get model
prob_scores_list=[]
for i in range(1,11):
nn = keras.models.load_model(f'all_models/model_{i}')
prob = nn.predict(ds)
prob_scores_list.append(prob[:, 0])
prob_scores = np.array(prob_scores_list)
return pd.Series(prob_scores.mean(axis=0))
df = df. \
withColumn('probability', predict_pandas_udf(*(F.col(c) for c in df.columns )))
return df
df=spark.table('p13n_features_explore.ss_ltr_low_confidence_pairs_with_features_amalgam')
scores=inference(df)
scores.write.mode('overwrite').parquet(f'gs://p13n-storage2/data/features/smart_subs/lcp_pairs_amalgam')
But this entire execution takes a massive amount of time; it takes around 5 hours. How can I make it executable in less than an hour? Can we remove the loop and somehow vectorize this implementation?
What are the various methods for making the multiple module ensemble inferencing on enormous data in pyspark faster?
Thanks.