6

My workflow : KAFKA -> Dataflow streaming -> BigQuery

Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDestination (one new table every hour, with the current hour as a suffix).

This BigQueryIO.Write is configured like this :

.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(triggeringFrequency)
.withNumFileShards(100)

The first table is successfully created and is written to. But then the following tables are never created and I get these exceptions:

(99e5cd8c66414e7a): java.lang.RuntimeException: Failed to create load job with id prefix 5047f71312a94bf3a42ee5d67feede75_5295fbf25e1a7534f85e25dcaa9f4986_00001_00023, reached max retries: 3, last failed load job: {
  "configuration" : {
    "load" : {
      "createDisposition" : "CREATE_NEVER",
      "destinationTable" : {
        "datasetId" : "dev_mydataset",
        "projectId" : "myproject-id",
        "tableId" : "mytable_20180302_16"
      },

For the first table the CreateDisposition used is CREATE_IF_NEEDED as specified, but then this parameter is not taken into account and CREATE_NEVER is used by default.

I also created the issue on JIRA.

benjben
  • 181
  • 7

2 Answers2

1

According to the documentation of Apache Beam's BigQueryIO, the method BigQueryIO.Write.CreateDisposition requires that a table schema is provided using the precondition .withSchema() when CREATE_IF_NEEDED is used.

As stated also in the Dataflow documentation:

Note that if you specify CREATE_IF_NEEDED as the CreateDisposition and you don't supply a TableSchema, the transform may fail at runtime with a java.lang.IllegalArgumentException if the target table does not exist.

The error that documentation states is not the same one as you are receiving (you get java.lang.RuntimeException), but according to the BigQueryIO.Write() configuration you shared, you are not specifying any table schema, and therefore, if tables are missing, the job is prone to failure.

So as a first measure to solve your issue, you should create the table schema TableSchema() that matches the data you will load into BQ, and then use the precondition .withSchema(schema) accordingly:

List<TableFieldSchema> fields = new ArrayList<>();
// Add fields like:
fields.add(new TableFieldSchema().setName("<FIELD_NAME>").setType("<FIELD_TYPE>"));
TableSchema schema = new TableSchema().setFields(fields);

// BigQueryIO.Write configuration plus:
    .withSchema(schema)
dsesto
  • 7,864
  • 2
  • 33
  • 50
  • 1
    Sorry, I didn't specify it because I thought it was not relevant but I do specify a schema, using `.withJsonSchema()`, and it works fine to create the first table. The problem seems to be that `CREATE_DISPOSITION` is ignored for the other tables than the ones in the first pane. Please see [Jonas' comment](https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609) on the JIRA issue for more details. – benjben Mar 21 '18 at 12:47
  • 1
    It does look like [how the class `WriteTables` handles panes](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L155) may be related to your issue. That's why I think this question can better be tracked in the Apache Beam forum you shared, as I do not know if there's a way to force that each new table is created in the first pane, so that it can use the defined `CreateDisposition`. Once the question is solved, feel free to post an answer in this post. Thanks! – dsesto Mar 24 '18 at 09:36
  • 1
    I am experiencing the same issue with Beam 2.6.0 and 2.9.0. I am also able to create the first table, but not any of the following. The error message is exactly the same as in the original post. I have commented in the [JIRA](https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16739264&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16739264) post with a sample code that fails. – mvelusce Jan 10 '19 at 10:12
0

This is a bug in Beam: https://issues.apache.org/jira/browse/BEAM-3772

It's still open in version 2.27

I developed a workaround : I wrote a custom PTransform which creates an empty table before BigqueryIO.Write stage. It uses the bigquery java client. You can see it here to get inspired: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1

matthieu.cham
  • 501
  • 5
  • 17