I have a collection of homogeneous dicts, how do I write them to BigQuery without knowing the schema?
The BigQuerySink requires that I specify the schema when I construct it. But, I don't know the schema: it's defined by the keys of the dicts I'm trying to write.
Is there a way to have my pipeline infer the schema, and then provide it back (as a sideinput?) to the sink?
For example:
# Create a PCollection of dicts, something like
# {'field1': 'myval', 'field2': 10}
data = (p | 'generate_data' >> beam.ParDo(CreateData())
# Infer the schema from the data
# Generates a string for each element (ok to assume all dict keys equal)
# "field1:STRING, field2:INTEGER"
schema = (data
| 'infer_schema' >> beam.ParDo(InferSchema())
| 'sample_one' >> beam.combiners.Sample.FixedSizeGlobally(1))
But then, how do I feed the schema as a parameter to the BigQuerySink, and use that in a beam.io.Write?
I know this isn't correct, but what I want to do is:
sink = BigQuerySink(tablename, dataset, project, schema=Materialize(schema))
p | 'write_bigquery' >> beam.io.Write(sink)
tl;dr Is there a way to create and write a bigquery table from apache beam programmatically inferring the schema from the data?