"ImportError: No module named sklearn" when --deploy-mode is cluster

Question

I have written the following pyspark code.

from pyspark.sql import SparkSession
import sys
import sklearn

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

print (sys.version_info)

When I run with:

spark-submit --master yarn --deploy-mode client test.py

it executes correctly. However, when I change --deploy-mode to the "cluster", i.e.:

spark-submit --master yarn --deploy-mode cluster test.py

I see the following error. I have no idea why this happens and how can I resolve it.

ImportError: No module named sklearn

I have seen this post. But it did not help me.

You accepted an answer which literally is the same as the answer in the duplicated post — mck, May 19 '21 at 08:13

score 0 · Accepted Answer · answered May 18 '21 at 22:47

--deploy-mode client will use the current machine where you submitting your Spark application as the driver, and obviously that machine has sklearn package installed. However, --deploy-mode cluster will randomly pick a driver from available resources, so you don't know upfront which machine will be driver, and one of them might not have sklearn package installed, hence the error you're facing with. So the solution is install sklearn package in all available nodes in your cluster

"ImportError: No module named sklearn" when --deploy-mode is cluster

1 Answers1