google dataproc - image version 2.0.x how to downgrade the pyspark version to 3.0.1

Question

Using dataproc image version 2.0.x in google cloud since delta 0.7.0 is available in this dataproc image version. However, this dataproc instance comes with pyspark 3.1.1 default, Apache Spark 3.1.1 has not been officially released yet. So there is no version of Delta Lake compatible with 3.1 yet hence suggested to downgrade.

I have tried the below,

pip install --force-reinstall pyspark==3.0.1

executed the above command as a root user on master node of dataproc instance, however, when I check the pyspark --version it is still showing 3.1.1

how to fix the default pyspark version to 3.0.1?

score 3 · Answer 1 · answered Feb 09 '21 at 02:38

The simplest way to use Spark 3.0 w/ Dataproc 2.0 is to pin an older Dataproc 2.0 image version (2.0.0-RC22-debian10) that used Spark 3.0 before it was upgraded to Spark 3.1 in the newer Dataproc 2.0 image versions:

gcloud dataproc clusters create $CLUSTER_NAME --image-version=2.0.0-RC22-debian10

score 0 · Answer 2 · answered Feb 08 '21 at 23:57

To use 3.0.1 version of spark you need to make sure that master and worker nodes in the Dataproc cluster have spark-3.0.1 jars in /usr/lib/spark/jars instead of 3.1.1 ones.

There are two ways you could do that:

Move 3.0.1 jars manually in each node to /usr/lib/spark/jars, and remove 3.1.1 ones. After doing pip install for the desired version of pyspark, you can find the spark jars in /.local/lib/python3.8/site-packages/pyspark/jars. Make sure to restart spark after this: sudo systemctl restart spark*
You can use dataproc init actions (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions?hl=en) to do the same as then you won't have to ssh each node and manually change the jars.

Steps:

Upload the updated Hadoop jars to a GCS folder, e.g., gs:///lib-updates, which has the same structure with the /usr/lib/ directory of the cluster nodes.

Write an init actions script which syncs updates from GCS to local /usr/lib/, then restart Hadoop services. Upload the script to GCS, e.g., gs:///init-actions-update-libs.sh.

 #!/bin/bash

 set -o nounset
 set -o errexit
 set -o xtrace
 set -o pipefail

 # The GCS folder of lib updates.

 LIB_UPDATES=$(/usr/share/google/get_metadata_value attributes/lib-updates)

 # Sync updated libraries from $LIB_UPDATES to /usr/lib

 gsutil rsync -r -e $LIB_UPDATES /usr/lib/

 # Restart spark services
 service spark-* restart

Create a cluster with --initialization-actions $INIT_ACTIONS_UPDATE_LIBS and --metadata lib-updates=$LIB_UPDATES.

google dataproc - image version 2.0.x how to downgrade the pyspark version to 3.0.1

2 Answers2