1

I'm interested in interactive development of a preprocessing_fn for tft.AnalyzeAndTransformDataSet. By interactive development, I mean running a standalone beam pipeline in a Jupyter Notebook and later connecting to the resulting transformed data with a tf.data.Dataset so I can inspect the results.

In other words, during interactive development, I do not want to run a TFX pipeline with a Transform component. I want to gradually build up my preprocessing_fn and inspect the results in a notebook while I iterate.

To achieve this, I'm modifying the Beam pipeline in the "Preprocessing data with TensorFlow Transform" advanced tutorial:

That tutorial uses a CSV for input data. I'm trying to refactor it to use BigQuery, but I'm stuck.

In the tutorial, the transform_data function first starts processing the input data by instantiating a tfxio.BeamRecordCsvIO class:

  csv_tfxio = tfxio.BeamRecordCsvTFXIO(
      physical_format='text',
      column_names=ORDERED_CSV_COLUMNS,
      schema=SCHEMA)

The pipeline then starts to create a raw_data PCollection with two standard beam PTransforms to read and clean text data:

  raw_data = (
      pipeline
      | 'ReadTrainData' >> beam.io.ReadFromText(
          train_data_file, coder=beam.coders.BytesCoder())
      | 'FixCommasTrainData' >> beam.Map(
          lambda line: line.replace(b', ', b','))

Then csv_tfxio.BeamSource is used to decode the train data:

      | 'DecodeTrainData' >> csv_tfxio.BeamSource())

and a raw_dataset tuple is created by combining the raw_data PCollection with a TensorAdapterConfig created from the csv_tfxio BeamRecordCsvTFXIO:

raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())

Finally, the raw_dataset tuple and the preprocessing_fn are passed to tft_beam.AnalyzeAndTransformDataset like so:

  transformed_dataset, transform_fn = (
      raw_dataset | tft_beam.AnalyzeAndTransformDataset(
          preprocessing_fn, output_record_batches=True))

I'd like to read from BigQuery without exporting data to csv, and substitute this step for the ReadFromText shown above:

    raw_data = (
        pipeline
        | 'ReadFromBigQuery' >> beam.io.ReadFromBigQuery(
            query=MY_QUERY,
            use_standard_sql=True)
    )

...but I can't figure out how to create the TensorAdapterConfig that AnalyzeAndTransformDataset requires.

Barely any documentation exists for these classes outside of comments within the TFX source code. In particular, I didn't see any BigQuery specific TFXIO subclasses in tfx_bsl, so I tried to hack something together using existing functions for csv processing, but it didn't work.

Specifically, I first created a SCHEMA from a raw feature spec in a similar fashion to the tutorial, using tft.tf_metadata.schema_utils.schema_from_feature_spec.

Because tensor representations are part of the TensorAdapterConfig, I then tried:

from tfx_bsl.tfxio.tensor_representation_util import GetTensorRepresentationsFromSchema
GetTensorRepresentationsFromSchema(SCHEMA)

But GetTensorRepresentationsFromSchema returns None, presumably because it is expecting a Schema format different from that provided by schema_from_feature_spec.

TensorAdapterConfig also needs a pyarrow Schema rather than a schema_pb2.Schema, so I tried to create one using the GetArrowSchema class in tfx_bsl.coders.csv_decoder, but I don't know if it would have worked because I couldn't create the tensor representations.

This tutorial successfully used BigQuery in a standalone TFT pipeline, but it uses old methods that I suspect existed prior to the implementation of the TFXIO rfc:

What's the best way to use BigQuery in a standalone Beam pipeline using up-to-date versions of TFX and TFT?

jb_ml_eng
  • 61
  • 4
  • I'm having a similar issue. Since upgrading to any Apache Beam higher than 2.20 I haven't been able to finish any job on GCP because it says that the schema is missing , also on the AnalyzeAndTransform. Did you were able to figure out how to create a TFT pipeline with BigQuery? – mcastilloy2k Mar 15 '22 at 14:20

0 Answers0