I am trying to run some code on a spark kubernetes cluster
"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"
The code I am trying to run is the following
def getMax(row, subtract):
'''
getMax takes two parameters -
row: array with parameters
subtract: normal value of the parameter
It outputs the value most distant from the normal
'''
try:
row = np.array(row)
out = row[np.argmax(row-subtract)]
except ValueError:
return None
return out.item()
from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())
The dataframe I am passing is as below
However I am getting the following error
ModuleNotFoundError: No module named 'numpy'
When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.
ImportError: No module named numpy on spark workers
And the crazy part is I am able to import numpy outside and
import numpy as np
command outside the function is not getting any errors.
Why is this happening? How to fix this or how to go forward. Any help is appreciated.
Thank you