How to write a preprocessing_fn function that relies on an Artifact?

Question

Let's consider a very straightforward pipeline like this:

example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)

statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])

schema_gen = tfx.components.SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=True)

transform = tfx.components.Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath('preprocessing_fn.py'),
    custom_config={'schema': schema_gen.outputs['schema']})

components = [
    example_gen,
    statistics_gen,
    schema_gen,
    transform,
]

_pipeline_data_folder = './simple_pipeline_data'
pipeline = tfx.dsl.Pipeline(
    pipeline_name='simple_pipeline',
    pipeline_root=_pipeline_data_folder,
    metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
        f'{_pipeline_data_folder}/metadata.db'),
    components=components)

tfx.orchestration.LocalDagRunner().run(pipeline)

The only part of this snippet worth pointing out is the Transform definition. Namely, I want to pass and use the schema artifact in my transformation logic. I was hoping I could write the preprocessing_fn like this:

import os
from google.protobuf import text_format

def preprocessing_fn(inputs, custom_config):
    schema = text_format.Parse(os.path.join(custom_config['schema'].uri, 'schema.pbtxt'))
    # do something to the inputs based on the read schema

But the problem is that the custom_config['schema'] is not an object of Artifact but a Channel. And the preprocessing_fn is called even before the artifact even exists. I mean I debugged and traced the code and realized that preprocessing_fn is called before the code that creates the artifact executes. So technically, it's impossible to use an artifact inside my preprocessing_fn function.

My question is, is there a way around this? Can I use the results of an artifact inside my preprocessing_fn function? How can I write a transformation that its logic is based on the results of the previous artifacts?

I love this question, and have been having similar thoughts. I have done something similar to simplify the relationship between the Trainer and the Tuner. I have some thoughts about how to solve this and would love to connect with you. — Pritam Dodeja, Feb 07 '23 at 05:30

How to write a preprocessing_fn function that relies on an Artifact?

0 Answers0