1

Using dataproc image version 2.0.x in google cloud since delta 0.7.0 is available in this dataproc image version. However, this dataproc instance comes with pyspark 3.1.1 default, Apache Spark 3.1.1 has not been officially released yet. So there is no version of Delta Lake compatible with 3.1 yet hence suggested to downgrade.

I have tried the below,

pip install --force-reinstall pyspark==3.0.1 

executed the above command as a root user on master node of dataproc instance, however, when I check the pyspark --version it is still showing 3.1.1

how to fix the default pyspark version to 3.0.1?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Rak
  • 196
  • 2
  • 9

2 Answers2

3

The simplest way to use Spark 3.0 w/ Dataproc 2.0 is to pin an older Dataproc 2.0 image version (2.0.0-RC22-debian10) that used Spark 3.0 before it was upgraded to Spark 3.1 in the newer Dataproc 2.0 image versions:

gcloud dataproc clusters create $CLUSTER_NAME --image-version=2.0.0-RC22-debian10
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
0

To use 3.0.1 version of spark you need to make sure that master and worker nodes in the Dataproc cluster have spark-3.0.1 jars in /usr/lib/spark/jars instead of 3.1.1 ones.

There are two ways you could do that:

  1. Move 3.0.1 jars manually in each node to /usr/lib/spark/jars, and remove 3.1.1 ones. After doing pip install for the desired version of pyspark, you can find the spark jars in /.local/lib/python3.8/site-packages/pyspark/jars. Make sure to restart spark after this: sudo systemctl restart spark*

  2. You can use dataproc init actions (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions?hl=en) to do the same as then you won't have to ssh each node and manually change the jars.

Steps:

  1. Upload the updated Hadoop jars to a GCS folder, e.g., gs:///lib-updates, which has the same structure with the /usr/lib/ directory of the cluster nodes.

  2. Write an init actions script which syncs updates from GCS to local /usr/lib/, then restart Hadoop services. Upload the script to GCS, e.g., gs:///init-actions-update-libs.sh.

     #!/bin/bash
    
     set -o nounset
     set -o errexit
     set -o xtrace
     set -o pipefail
    
     # The GCS folder of lib updates.
    
     LIB_UPDATES=$(/usr/share/google/get_metadata_value attributes/lib-updates)
    
     # Sync updated libraries from $LIB_UPDATES to /usr/lib
    
     gsutil rsync -r -e $LIB_UPDATES /usr/lib/
    
     # Restart spark services
     service spark-* restart
    
  3. Create a cluster with --initialization-actions $INIT_ACTIONS_UPDATE_LIBS and --metadata lib-updates=$LIB_UPDATES.

Gaurangi Saxena
  • 236
  • 1
  • 4