Validating rows before inserting into BigQuery from Dataflow

Question

According to How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.

If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?

There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.

Update:

My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.

In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.

My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.

Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".

Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.

In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".

From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.

People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?

At this time, I believe the filtering strategy that you are suggesting --- cross-referencing the schema with the row --- is the only option. — Tudor Marian, Feb 04 '16 at 23:22
I think I'm missing something from this question... If you know the schema of the table you are inserting data to, then your Dataflow job can construct output in the appropriate schema in the first place. There is no need for an additional verification step. — Michael Sheldon, Feb 05 '16 at 00:26
@MichaelSheldon just because I know the schema doesn't mean that my input data is well-formed. I have input data that needs to be transformed into a shape that fits the target table. I know the target table schema, and I have the transformation, but that's not enough. If a single field in the input is of a data type that does not work with the data type of the target column _the whole pipeline fails_. I can write pages of code that tries to verify that the data conforms to the schema, but if there is a single discrepancy between that verification and BQ's own my pipeline might still fail. — Theo, Feb 05 '16 at 06:55
@TudorMarian I was afraid of that. How do people work around it? — Theo, Feb 05 '16 at 06:57
Maybe it helps to know that there are three tables with around a 100 columns in total, which will be loaded with about 300 million rows per day. — Theo, Feb 05 '16 at 19:50
@Theo Unfortunately we're not aware of a workaround. There is currently no standard way to check a TableRow against a TableSchema programmatically. — jkff, Feb 10 '16 at 01:02
@jkff ok, thanks for the definitive answer. At least I haven’t missed anything obvious in the docs. — Theo, Feb 10 '16 at 05:53
@Theo, can you say more about how your TableRow objects are constructed? It is often easier to ensure that you are doing sane things when *producing* TableRows rather than when you're writing them. Also note that when running a Streaming Dataflow job, the entire pipeline does not fail when writing to BigQuery. Instead, the specific rows fail with a reasonable error message, and you can update the pipeline with a new version of the code that gracefully handles those errors. — Dan Halperin, Feb 10 '16 at 15:44
@DanHalperin I've updated the question with a lot more information about my specific use case. Thank you for taking your time with my problem. — Theo, Feb 10 '16 at 16:34
@Dan Halperin - Could you point me towards the reasonable error message? I just get "org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl insertAll\nINFO: Retrying x inserts into BigQuery". — nsandersen, Oct 23 '19 at 10:00

Dan Halperin · Accepted Answer · 2016-02-11T19:15:49.987

I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.

I'd recommend an iterative approach to developing your schema validation, with the following two steps.

Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.

Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.

Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in BigQueryIO.java that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.

You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.

(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)

Finally, in terms of logging information:

Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.

Thanks for taking your time to write such a detailed answer. I started running locally and got everything working, and then switched to running in Dataflow proper. I was really surprised when it didn't work the same way. Then I tried to debug my files by running them locally again, but couldn't since the direct runner doesn't seem to handle large input files (no amount of memory seems to be enough for it to handle one gigabyte of input). I will look the code you link to and the blog post, these pointers are much appreciated. — Theo, Feb 10 '16 at 18:21
I'm also curious how you pass a `TableSchema` to a `PTransform` or `DoFn`. It's not serializable, so I've had to jump through some hoops to get it into where it needs to be. — Theo, Feb 10 '16 at 18:25
It's annoying when objects aren't Serializable. I have solved this in the past by using the `JsonFactory` methods that come with the BigQuery API library: `TableSchema schema = JacksonFactory.getDefaultInstance().fromString(schemaStr, TableSchema.class);` Also updating answer. — Dan Halperin, Feb 11 '16 at 19:09
I wish I could upvote your answer again when you add more goodness to it. — Theo, Feb 11 '16 at 19:41
I can also report that I've successfully loaded 200 million rows of data into BigQuery with my Dataflow job after using some of your suggestions. — Theo, Feb 11 '16 at 19:43
@DanHalperin is this still the best approach? Or have any features been added over past year that would make this easier? — g-eorge, Jan 18 '17 at 12:06
@g-eorge, looks like here is one of ways to do it: https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix Unfortunatelly (for me) Python SDK doesn't have it yet :(( — KirylP, Apr 24 '19 at 18:36

Validating rows before inserting into BigQuery from Dataflow

1 Answers1

Linked