We're using a bootstrap script for installing python libraries on the EMR cluster nodes for our Spark jobs. The script looks something like this:
sudo python3 -m pip install pandas==0.22.0 scikit-learn==0.21.0
Once the cluster is up, we use Airflow's SparkSubmitHook
to submit jobs to EMR. We use this configuration to bind pyspark to python3. Problem is, once in a while, when the job starts running, we get ModuleNotFoundError: No module named 'sklearn'
error. One such stacktrace is like this one below:
return self.loads(obj)
File "/mnt1/yarn/usercache/root/appcache/application_1565624418111_0001/container_1565624418111_0001_01_000033/pyspark.zip/pyspark/serializers.py", line 577, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'sklearn'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
This issue is sporadic in nature, so out of 10 job submissions it might be happening 2-3 times. We're using EMR 5.23.0. I've tried upgrading to 5.26.0 as well, but same issue persists.
If I go to the cluster nodes, and check for that 'missing' package, I can see it's already installed. So, clearly it's not the issue with bootstrap script. That leaves me quite confused, because I've no clue whatsoever on what's going on here. I'd guess that it's binding to a different python version when the job gets triggered from Airflow, but that's just a shot in the dark. Any help is appreciated.