How to write data in Google Cloud Bigtable in PySpark application on dataproc?

Question

I am using Spark on a Google Cloud Dataproc cluster and I would like to write in Bigtable in a PySpark job. As google connector for the same is not available, I am simply using google cloud bigtable client to insert the data and use spark for parallelism. I am not able to bundle google-cloud-python package so that its accessible on the dataproc cluster. I have download the wheel (whl) for google-cloud-bigtable and converted it to egg. Still its not working.

Is there any example of using google python clinet in pyspark job? Also it would be really helpful to know how it can be made available on cluster.

I was able to install the `google-cloud-bigtable` client in a dataproc cluster using an [initialization script](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions) with these commands: `sudo apt-get install python-pip python-dev -y` and `sudo pip install google-cloud` After that, I could submit the bigtable python "hello world" [example](https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/bigtable/hello) as a pyspark job like so: `gcloud dataproc jobs submit pyspark main.py --cluster=$CLUSTER -- $PROJECT $BIGTABLE_INSTANCE` — Lefteris S, May 21 '18 at 16:07
Also, from [this](https://stackoverflow.com/a/47358499/9251751) answer by a googler working on BigTable to an older question, there are no good examples of integrating with PySpark yet but they are "on their radar". — Lefteris S, May 21 '18 at 16:08

How to write data in Google Cloud Bigtable in PySpark application on dataproc?

0 Answers0