0

I am trying to run some code on a spark kubernetes cluster

"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"

The code I am trying to run is the following

def getMax(row, subtract):
    '''
    getMax takes two parameters - 
    row: array with parameters
    subtract: normal value of the parameter
    It outputs the value most distant from the normal
    '''
    try:
        row = np.array(row)
        out = row[np.argmax(row-subtract)]
    except ValueError:
        return None
    return out.item()

from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())

The dataframe I am passing is as below

enter image description here

However I am getting the following error

ModuleNotFoundError: No module named 'numpy'

When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.

ImportError: No module named numpy on spark workers

And the crazy part is I am able to import numpy outside and

import numpy as np 

command outside the function is not getting any errors.

Why is this happening? How to fix this or how to go forward. Any help is appreciated.

Thank you

  • 1
    It just means that your worker nodes do not have numpy installed, you can either go in and install it or you can distribute the numpy library on invocation as mentioned in your link. – Richard Nemeth Feb 18 '20 at 05:50
  • https://jcrist.github.io/venv-pack/spark.html this will help you pack the virtualenv and it has link to ship your virualenv – E.ZY. Feb 18 '20 at 16:34

0 Answers0