How to add bigquery-connector to an existing cluster on dataproc

Question

I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code :

df = spark.read.format('bigquery').load('bigquery-public-data.samples.shakespeare')

I get an error with some part of like this :

java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html

I found some tutorial like in this git repo : https://github.com/GoogleCloudDataproc/spark-bigquery-connector

But idk where to write that scripts and run them.Could you make me understand?

Thanks in advance

Does this answer your question? [How to add jar dependency to dataproc cluster in GCP?](https://stackoverflow.com/questions/58769692/how-to-add-jar-dependency-to-dataproc-cluster-in-gcp) — Lamanus, Nov 25 '21 at 12:14
Thanks for reply. it made me understand a little bit but I couldn't find the solution yet.I need to work on jupyter notebook which under the dataproc cluster. — Kerem Tatlıcı, Nov 25 '21 at 13:41
Github readme page contains steps how to enable biquery API and install jars (https://github.com/GoogleCloudDataproc/spark-bigquery-connector). Did you go through the installation steps? — ntr, Dec 14 '21 at 17:27

score 1 · Accepted Answer · answered Dec 14 '21 at 12:14

While creating a cluster, i opened the gcp console and type this script

gcloud dataproc clusters create clusterName --bucket bucketName --region europe-west3 --zone europe-west3-a --master-machine-type n1-standard-16 --master-boot-disk-type pd-ssd --master-boot-disk-size 200 --num-workers 2 --worker-machine-type n1-highmem-16 --worker-boot-disk-size 200 --image-version 2.0-debian10 --max-idle 3600s --optional-components JUPYTER --initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0 --project projectName --enable-component-gateway

the -initialization-actions part of the script was worked for me :

--initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0

How to add bigquery-connector to an existing cluster on dataproc

1 Answers1