I'm interested in interactive development of a preprocessing_fn for tft.AnalyzeAndTransformDataSet. By interactive development, I mean running a standalone beam pipeline in a Jupyter Notebook and later connecting to the resulting transformed data with a tf.data.Dataset so I can inspect the results.
In other words, during interactive development, I do not want to run a TFX pipeline with a Transform component. I want to gradually build up my preprocessing_fn and inspect the results in a notebook while I iterate.
To achieve this, I'm modifying the Beam pipeline in the "Preprocessing data with TensorFlow Transform" advanced tutorial:
That tutorial uses a CSV for input data. I'm trying to refactor it to use BigQuery, but I'm stuck.
In the tutorial, the transform_data function first starts processing the input data by instantiating a tfxio.BeamRecordCsvIO class:
csv_tfxio = tfxio.BeamRecordCsvTFXIO(
physical_format='text',
column_names=ORDERED_CSV_COLUMNS,
schema=SCHEMA)
The pipeline then starts to create a raw_data PCollection with two standard beam PTransforms to read and clean text data:
raw_data = (
pipeline
| 'ReadTrainData' >> beam.io.ReadFromText(
train_data_file, coder=beam.coders.BytesCoder())
| 'FixCommasTrainData' >> beam.Map(
lambda line: line.replace(b', ', b','))
Then csv_tfxio.BeamSource is used to decode the train data:
| 'DecodeTrainData' >> csv_tfxio.BeamSource())
and a raw_dataset tuple is created by combining the raw_data PCollection with a TensorAdapterConfig created from the csv_tfxio BeamRecordCsvTFXIO:
raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())
Finally, the raw_dataset tuple and the preprocessing_fn are passed to tft_beam.AnalyzeAndTransformDataset like so:
transformed_dataset, transform_fn = (
raw_dataset | tft_beam.AnalyzeAndTransformDataset(
preprocessing_fn, output_record_batches=True))
I'd like to read from BigQuery without exporting data to csv, and substitute this step for the ReadFromText shown above:
raw_data = (
pipeline
| 'ReadFromBigQuery' >> beam.io.ReadFromBigQuery(
query=MY_QUERY,
use_standard_sql=True)
)
...but I can't figure out how to create the TensorAdapterConfig that AnalyzeAndTransformDataset requires.
Barely any documentation exists for these classes outside of comments within the TFX source code. In particular, I didn't see any BigQuery specific TFXIO subclasses in tfx_bsl, so I tried to hack something together using existing functions for csv processing, but it didn't work.
Specifically, I first created a SCHEMA
from a raw feature spec in a similar fashion to the tutorial, using tft.tf_metadata.schema_utils.schema_from_feature_spec.
Because tensor representations are part of the TensorAdapterConfig, I then tried:
from tfx_bsl.tfxio.tensor_representation_util import GetTensorRepresentationsFromSchema
GetTensorRepresentationsFromSchema(SCHEMA)
But GetTensorRepresentationsFromSchema returns None, presumably because it is expecting a Schema format different from that provided by schema_from_feature_spec.
TensorAdapterConfig also needs a pyarrow Schema rather than a schema_pb2.Schema, so I tried to create one using the GetArrowSchema class in tfx_bsl.coders.csv_decoder, but I don't know if it would have worked because I couldn't create the tensor representations.
This tutorial successfully used BigQuery in a standalone TFT pipeline, but it uses old methods that I suspect existed prior to the implementation of the TFXIO rfc:
What's the best way to use BigQuery in a standalone Beam pipeline using up-to-date versions of TFX and TFT?