3

I'm trying to load data from GCS bucket and publish content to pubsub and bigquery. These are my pipeline options:

options = PipelineOptions(
      project = project,
      temp_location = "gs://dataflow-example-bucket6721/temp21/",
      region = 'us-east1',
      job_name = "dataflow2-pubsub-09072021",
      machine_type = 'e2-standard-2',
   )

And this is my pipeline

data = p | 'CreateData' >> beam.Create(sum([fileName()], []))

jsonFile =  data | "filterJson" >> beam.Filter(filterJsonfile)

JsonData = jsonFile | "JsonData" >> beam.Map(readFromJson)

split_data = JsonData | 'Split Data' >> ParDo(CheckForValidData()).with_outputs("ValidData", "InvalidData")

ValidData = split_data.ValidData
InvalidData = split_data.InvalidData
data_ = split_data[None]


publish_data = ValidData | "Publish msg" >> ParDo(publishMsg())

ToBQ = ValidData | "To BQ" >> beam.io.WriteToBigQuery(
            table_spec,
            #schema=table_schema,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

The data is flowing fine in InteractiveRunner but in DataflowRunner it is showing an error like

ValueError: Invalid GCS location: None. Writing to BigQuery with FILE_LOADS method requires a GCS location to be provided to write files to be loaded into BigQuery. Please provide a GCS bucket through custom_gcs_temp_location in the constructor of WriteToBigQuery or the fallback option --temp_location, or pass method="STREAMING_INSERTS" to WriteToBigQuery. [while running '[15]: To BQ/BigQueryBatchFileLoads/GenerateFilePrefix']

It is showing error of GCS location and suggest to add temp_location. but I have already added temp_location.

James Z
  • 12,209
  • 10
  • 24
  • 44
  • error suggests `custom_gcs_temp_location`, not `temp_location` – furas Jul 10 '21 at 06:11
  • Even with `custom_gcs_temp_location` it is showing the same error @furas – Jigna Chandarana Jul 10 '21 at 06:17
  • is this FULL error message? Maybe using something like `print(...)` between all lines you could localize which line makes problem. – furas Jul 10 '21 at 06:28
  • Did you try using beam.io.Write, beam.io.BigQuerySink and see that it works instead of WriteToBigQuery? OR based on the error what I see can you pls try adding temp_location specifically and check if it works? This is what the error suggest .I have used beam.io.Write, beam.io.BigQuerySink for my pipeline to read data from gcs to big query and never faced this issue. Yours is batch processing anyway? – radhika sharma Jul 10 '21 at 07:13
  • Thanks for your help furas and Radhika Sharma. I just factory reset runtime (In colab) and tried, all the things working fine now! (previously I was doing restart runtime). I don't know how but it is working now!!! – Jigna Chandarana Jul 12 '21 at 04:12
  • When running with Google Dataflow, you'll always need to specify a GCS temp location to store the dataflow states so here the issue isnt the bucket you're planning to write to :) – Alex Oct 27 '21 at 19:40

1 Answers1

4

When running your Dataflow pipeline pass the argument --temp_location gs://bucket/subfolder/ (exactly in this format, create a subfolder inside the bucket) and should work.

Vipul Mehra
  • 141
  • 1
  • 2
  • 8