7

I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this?

This is how I am loading the data:

df = spark.read \
  .format('bigquery') \
  .option('table', 'publicdata.samples.shakespeare') \
  .load()

Some transformations:

 df_new = df.select("word")

And this I how I am trying to save the data as a new table into my project area:

df_new \
.write \
.mode('overwrite') \
.format('bigquery') \
.save('my_project.some_schema.df_new_table')

Is this even possible? Is there a way to save to BQ directly?

ps: I know this works but this is not exactly what I am looking for:

df_new \
.write \
.mode('overwrite') \
.format('csv') \
.save('gs://my_bucket/df_new.csv')

Thanks!

Totor
  • 125
  • 1
  • 2
  • 10
  • Does this solve your problem? https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example – Dagang Aug 30 '19 at 15:55
  • 1
    Resources to consider... Write a DataFrame to BigQuery table using [pandas_gbq](https://pandas-gbq.readthedocs.io/en/latest/) module -> https://pandas-gbq.readthedocs.io/en/latest/writing.html# By shelling out to the bq command-line (see PySpark example) Use the BigQuery connector with Spark https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example Google BigQuery support for Spark, SQL, and DataFrames (contributed by Spotify) https://github.com/spotify/spark-bigquery – Stéphane Fréchette Aug 30 '19 at 16:19
  • found a resource which might help https://github.com/spotify/spark-bigquery – maogautam Aug 30 '19 at 21:20

2 Answers2

5

Here is the documentation for the BigQuery connector with Spark

This is how it's recommended:

# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .option('table', 'wordcount_dataset.wordcount_output') \
  .save()

You set the table in the option() instead of the "save()"

Cristian Ispan
  • 571
  • 2
  • 5
  • 23
Nathan Nasser
  • 1,008
  • 7
  • 18
  • 1
    It seems they have added this recently! Thanks! – Totor Jan 10 '20 at 14:37
  • 1
    Specifying the table as an option is deprecated. Instead, the `path` param of `save` should be used as stated in the question actually, but the difference is the `project` should not be specified in the table path but rather either from the service account used or explicitly using the `parentProject` option. https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties – Mousa Dec 29 '22 at 12:35
0

Following syntax will create/overite table

         df.write.format('bigquery').option('table', ( 'project.db.tablename')).mode("overwrite").save()
   

For more information you can refer the following link https://dbmstutorials.com/pyspark/spark-dataframe-write-modes.html

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 07 '22 at 05:16