I have a function 'GiniLib' with 3 input arguments. I like to have this function calculated on many columns of my pyspark dataframe. Since it's very slow I'd like to parallelize it with either pool from multiprocessing or with parallel from joblib.
import pyspark.pandas as ps
def GiniLib(data: ps.DataFrame, target_col, obs_col):
evaluator = BinaryClassificationEvaluator()
evaluator.setRawPredictionCol(obs_col)
evaluator.setLabelCol(target_col)
auc = evaluator.evaluate(data, {evaluator.metricName: "areaUnderROC"})
gini = 2 * auc - 1.0
return(auc, gini)
col_names = df.columns
for i in col_names:
print(GiniLib(df.select(i, target_name), target_name, i))
The above code is so slow. I tried the following code but I get an error.
from multiprocessing.pool import Pool
if __name__ == '__main__':
with Pool() as pool:
args = [(df.select(i, target_name), target_name, i) for i in col_names]
for res in pool.starmap(GiniLib, args):
print(res)
The error I get: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
How can I make this calculation faster? is there any other way for calculating in-built auc library for many columns faster?