2

When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python.

How can I read a file from storage from within a ParDo? Or do I sideload to my Pipeline before the ParDo but how do I utilize this second source withtin the ParDo then?

[EDIT]

My main data comes from BQ: beam.io.Read(beam.io.BigQuerySource(...
The side input also comes from BQ, using the same BigQuerySource.

When I then add a step after the main data side inputing the other data I get some strange errors. I notice that when I do beam.Map(lambda x: x) to the side input it works.

side input

schema_data = (p | "read schema data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select * from `{schema_table}` limit 1", use_standard_sql=True, flatten_results=True))
                         | beam.Map(lambda x: x)
                       )

main data

    source_data = (p | "read source data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select {columns} from `{source_table}` limit 10", use_standard_sql=True, flatten_results=True)))  

combining

validated_records = source_data | 'record validation' >> beam.ParDo(Validate(), pvalue.AsList(schema_data))
Thijs
  • 1,423
  • 15
  • 38

2 Answers2

3

I would use the docs you mention as a library reference and go through the Beam programming guide for more detailed walkthroughs: side input section. I'll try to help with a couple examples in which we'll download a BigQuery schema from a public table and upload it to GCS:

bq show --schema bigquery-public-data:usa_names.usa_1910_current > schema.json
gsutil cp schema.json gs://$BUCKET

Our data will be some csv rows without headers so that we have to use the GCS schema:

data = [('NC', 'F', 2020, 'Hello', 3200),
        ('NC', 'F', 2020, 'World', 3180)]

Using side inputs

We read the JSON file into a schema PCollection:

schema = (p 
  | 'Read Schema from GCS' >> ReadFromText('gs://{}/schema.json'.format(BUCKET)))

and then we pass it to the ParDo as a side input so that it's broadcasted to every worker that executes the DoFn. In this case, we can use AsSingleton as we just one want to supply the schema as a single value:

(p
  | 'Create Events' >> beam.Create(data) \
  | 'Enrich with side input' >> beam.ParDo(EnrichElementsFn(), pvalue.AsSingleton(schema)) \
  | 'Log elements' >> beam.ParDo(LogElementsFn()))

Now we can access the schema in the process method of EnrichElementsFn:

class EnrichElementsFn(beam.DoFn):
  """Zips data with schema stored in GCS"""
  def process(self, element, schema):
    field_names = [x['name'] for x in json.loads(schema)]
    yield zip(field_names, element)

Note that it would be better to do the schema processing (to construct field_names) before saving it as a singleton to avoid duplicated work but this is just an illustrative example.


Using start bundle

In this case we don't pass any additional input to the ParDo:

(p
  | 'Create Events' >> beam.Create(data) \
  | 'Enrich with start bundle' >> beam.ParDo(EnrichElementsFn()) \
  | 'Log elements' >> beam.ParDo(LogElementsFn()))

And now we use the Python Client Library (we need to install google-cloud-storage) to read the schema each time that a worker initializes a bundle:

class EnrichElementsFn(beam.DoFn):
  """Zips data with schema stored in GCS"""
  def start_bundle(self):
    from google.cloud import storage

    client = storage.Client()
    blob = client.get_bucket(BUCKET).get_blob('schema.json')
    self.schema = blob.download_as_string()

  def process(self, element):
    field_names = [x['name'] for x in json.loads(self.schema)]
    yield zip(field_names, element)

The output is the same in both cases:

INFO:root:[(u'state', 'NC'), (u'gender', 'F'), (u'year', 2020), (u'name', 'Hello'), (u'number', 3200)]
INFO:root:[(u'state', 'NC'), (u'gender', 'F'), (u'year', 2020), (u'name', 'World'), (u'number', 3180)]

Tested with 2.16.0 SDK and the DirectRunner.

Full code for both examples here.

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • 1
    This looks great. But, you use DirectRunner and it is completely different than DataflowRunner, I had tens of situations where things run locally but not remote. This is just another example, sideloading works fine locally but not with DataflowRunner. The error is 'invalid table name' but Dataflow returns mostly random error messages so I'm not sure where the problem is it this point, at least it's certain that the real problem is not the table name. – Thijs Jan 16 '20 at 11:13
  • I'll try to fix it, update my question and if I get sideloading to work I'll accept your answer. I take the schema from a BQ table, maybe that's causing some trouble. – Thijs Jan 16 '20 at 11:19
  • 1
    Does it work when you add `beam.Map(lambda x: x)` then? If so, even if it seems like it does nothing, it might correct the type needed as input for `pvalue.AsList()`. If the error persists, can you add a full example, including an example of the schema stored in BigQuery and error stack trace? Otherwise, I feel like it will require a lot of guessing – Guillem Xercavins Jan 18 '20 at 19:35
  • Yes the lambda works, so it may be the data type then causing the problem. Is the BigQuery.read nog 'materialized' when run? – Thijs Jan 21 '20 at 10:32
0

I found a similar question here. As far as this post comments, If your schema file (in this case JSON) is in a known location in GCS, you can add a ParDo to your pipeline that directly reads it from GCS using a start_bundle() implementation.

You can use Beam's FileSystem abstraction if you need to abstract out the file-system that you use to store the schema file (not just GCS).

Also, you can read/download files from storage using the Google Cloud Storage’s API.

I also found here a blog that talks about the differente source reading patterns when using Google Cloud Dataflow.

I hope this helps.