How to save a spark DataFrame back into a Google BigQuery project using pyspark?

Question

I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this?

This is how I am loading the data:

df = spark.read \
  .format('bigquery') \
  .option('table', 'publicdata.samples.shakespeare') \
  .load()

Some transformations:

 df_new = df.select("word")

And this I how I am trying to save the data as a new table into my project area:

df_new \
.write \
.mode('overwrite') \
.format('bigquery') \
.save('my_project.some_schema.df_new_table')

Is this even possible? Is there a way to save to BQ directly?

ps: I know this works but this is not exactly what I am looking for:

df_new \
.write \
.mode('overwrite') \
.format('csv') \
.save('gs://my_bucket/df_new.csv')

Thanks!

Does this solve your problem? https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example — Dagang, Aug 30 '19 at 15:55
Resources to consider... Write a DataFrame to BigQuery table using [pandas_gbq](https://pandas-gbq.readthedocs.io/en/latest/) module -> https://pandas-gbq.readthedocs.io/en/latest/writing.html# By shelling out to the bq command-line (see PySpark example) Use the BigQuery connector with Spark https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example Google BigQuery support for Spark, SQL, and DataFrames (contributed by Spotify) https://github.com/spotify/spark-bigquery — Stéphane Fréchette, Aug 30 '19 at 16:19
found a resource which might help https://github.com/spotify/spark-bigquery — maogautam, Aug 30 '19 at 21:20

score 5 · Answer 1 · edited Mar 28 '23 at 22:55

5

Here is the documentation for the BigQuery connector with Spark

This is how it's recommended:

# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .option('table', 'wordcount_dataset.wordcount_output') \
  .save()

You set the table in the option() instead of the "save()"

edited Mar 28 '23 at 22:55

Cristian Ispan

571
2
5
23

answered Jan 08 '20 at 04:33

Nathan Nasser

1,008
7
18

1

It seems they have added this recently! Thanks! – Totor Jan 10 '20 at 14:37
1

Specifying the table as an option is deprecated. Instead, the `path` param of `save` should be used as stated in the question actually, but the difference is the `project` should not be specified in the table path but rather either from the service account used or explicitly using the `parentProject` option. https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties – Mousa Dec 29 '22 at 12:35

score 0 · Answer 2 · answered Jun 07 '22 at 04:25

0

Following syntax will create/overite table

         df.write.format('bigquery').option('table', ( 'project.db.tablename')).mode("overwrite").save()

For more information you can refer the following link https://dbmstutorials.com/pyspark/spark-dataframe-write-modes.html

answered Jun 07 '22 at 04:25

Navaneetha krishnan

11
3

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 07 '22 at 05:16

How to save a spark DataFrame back into a Google BigQuery project using pyspark?

2 Answers2