When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton
and apache_beam.pvalue.AsSideInput
but there are zero results if I Google on the usage of those and I can't find any example for Python.
How can I read a file from storage from within a ParDo? Or do I sideload to my Pipeline before the ParDo but how do I utilize this second source withtin the ParDo then?
[EDIT]
My main data comes from BQ: beam.io.Read(beam.io.BigQuerySource(...
The side input also comes from BQ, using the same BigQuerySource
.
When I then add a step after the main data side inputing the other data I get some strange errors. I notice that when I do beam.Map(lambda x: x)
to the side input it works.
side input
schema_data = (p | "read schema data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select * from `{schema_table}` limit 1", use_standard_sql=True, flatten_results=True))
| beam.Map(lambda x: x)
)
main data
source_data = (p | "read source data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select {columns} from `{source_table}` limit 10", use_standard_sql=True, flatten_results=True)))
combining
validated_records = source_data | 'record validation' >> beam.ParDo(Validate(), pvalue.AsList(schema_data))