0

I am developing a Dataflow pipeline which is reading a protobuf file from google cloud storage and parsing it and trying to write to BigQuery table. It is working fine when no. of rows is around 20k but when no. of rows is around 200k then it fails. Below is sample code:

Pipeline pipeline = Pipeline.create(options);

        PCollection<PBClass> dataCol = pipeline.apply(FileIO.match().filepattern(options.getInputFile()))
                .apply(FileIO.readMatches())
                .apply("Read GPB File", ParDo.of(new ParseGpbFn()));

dataCol.apply("Transform to Delta", ParDo.of(deltaSchema))
                .apply(Flatten.iterables())
                .apply(
                        BigQueryIO
                                //.write()
                                .writeTableRows()
                                .to(deltaSchema.tableSpec)
                                .withMethod(Method.STORAGE_WRITE_API)
                                .withSchema(schema)
                                //.withFormatFunction(irParDeltaSchema)
                                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                                .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
                                .withExtendedErrorInfo()
                )
        ;

have tried different combination of following methods

withMethod
write
withFormatFunction

also different no. of workers and differnt compute engine type.

Everytime it is stuck at GroupByKey stage and give following error:

Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_testjobpackage_<...>, reached max retries: 3, last failed job: null.
    org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
    org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
    org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:322)

Job steps table view: enter image description here and graph view: enter image description here

ravi
  • 391
  • 4
  • 12

1 Answers1

0
Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_testjobpackage_<...>, reached max retries: 3, last failed job: null.
            org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
            org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
            org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:322)

The error code that you are receiving -described above- is because somewhere within your code, when you are specifying the GCS file that you want to load, it is malformed, the URI is expected to look something like this gs://bucket/path/to/file.

Eduardo Ortiz
  • 715
  • 3
  • 14
  • reading from gcs is not a problem, the job is able to read the file from GCS as it is able to write to 1 of the table which has less no. of rows but fails to write to other table where no. of rows is very high – ravi Nov 03 '21 at 12:33
  • There's a limit to the number of rows that you have to write, can you try and write the rows in parts? – Eduardo Ortiz Dec 07 '21 at 17:03