Programmatically generating BigQuery schema in Beam pipeline

Question

I have a collection of homogeneous dicts, how do I write them to BigQuery without knowing the schema?

The BigQuerySink requires that I specify the schema when I construct it. But, I don't know the schema: it's defined by the keys of the dicts I'm trying to write.

Is there a way to have my pipeline infer the schema, and then provide it back (as a sideinput?) to the sink?

For example:

# Create a PCollection of dicts, something like
# {'field1': 'myval', 'field2': 10}
data = (p | 'generate_data' >> beam.ParDo(CreateData())

# Infer the schema from the data
# Generates a string for each element (ok to assume all dict keys equal)
# "field1:STRING, field2:INTEGER"
schema = (data
  | 'infer_schema' >> beam.ParDo(InferSchema())
  | 'sample_one'   >> beam.combiners.Sample.FixedSizeGlobally(1))

But then, how do I feed the schema as a parameter to the BigQuerySink, and use that in a beam.io.Write?

I know this isn't correct, but what I want to do is:

sink = BigQuerySink(tablename, dataset, project, schema=Materialize(schema))
p | 'write_bigquery' >> beam.io.Write(sink)

tl;dr Is there a way to create and write a bigquery table from apache beam programmatically inferring the schema from the data?

The BigQuery API has an autodetect schema feature. If this is not supported in Beam, it's worth filing an issue against the Beam SDK. — Tim Swast, Jun 30 '17 at 16:41
Docs for schema autodetection are at https://cloud.google.com/bigquery/docs/schema-detect — Tim Swast, Jun 30 '17 at 16:42
Thanks Tim. Unf, currently the API rejects a sink without a schema if the table doesn't exist yet. I'll see if I can figure out where to file a feature request to Beam. — Greg, Jun 30 '17 at 17:50

score 0 · Answer 1 · answered Jul 06 '17 at 17:11

Assuming that your schema can change frequently, it may work better for you to keep the data in a more generic form.

For example, your row may consist of a single repeated record (your dictionary entries).

The record schema looks like: key (STRING) | optional string_val (STRING) | optional int_val (INTEGER) optional double_val (DOUBLE) | optional boolean_val (BOOLEAN) | ...

Then you can write queries that scan your records by type. This is somewhat less efficient (because you'll have to scan rows that you might otherwise be able to skip if they were in different columns), but entirely avoids specifying your schema up-front.

Greg · Answer 2 · 2017-07-07T23:34:25.507

For now the best solution I've come up with is explicitly hardcoding a mapping of dict keys to BigQuery schema. Two benefits -- it works around the must-specify-schema issue and it lets me filter elements out of the dict I don't want in the BigQuery.

SCHEMA = {
  'field1': 'INTEGER',
  'field2': 'STRING',
  ...
}
schema_str = ','.join(['%s:%s' % (k, v) for k,v in SCHEMA.iteritems()])

sink = BigQuerySink(tablename,
        dataset=dataset,
        project=project,
        schema=schema_str,
        write_disposition=BigQueryDisposition.WRITE_TRUNCATE)

(pipeline
  # filters just the keys of each dict to the keys of SCHEMA.
  | 'filter_fields' >> beam.ParDo(FilterFieldKeysDoFn(SCHEMA))
  | 'to_bigquery' >> beam.io.Write(sink))

Programmatically generating BigQuery schema in Beam pipeline

2 Answers2