Let's consider a very straightforward pipeline like this:
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True)
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath('preprocessing_fn.py'),
custom_config={'schema': schema_gen.outputs['schema']})
components = [
example_gen,
statistics_gen,
schema_gen,
transform,
]
_pipeline_data_folder = './simple_pipeline_data'
pipeline = tfx.dsl.Pipeline(
pipeline_name='simple_pipeline',
pipeline_root=_pipeline_data_folder,
metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
f'{_pipeline_data_folder}/metadata.db'),
components=components)
tfx.orchestration.LocalDagRunner().run(pipeline)
The only part of this snippet worth pointing out is the Transform
definition. Namely, I want to pass and use the schema
artifact in my transformation logic. I was hoping I could write the preprocessing_fn
like this:
import os
from google.protobuf import text_format
def preprocessing_fn(inputs, custom_config):
schema = text_format.Parse(os.path.join(custom_config['schema'].uri, 'schema.pbtxt'))
# do something to the inputs based on the read schema
But the problem is that the custom_config['schema']
is not an object of Artifact
but a Channel
. And the preprocessing_fn
is called even before the artifact even exists. I mean I debugged and traced the code and realized that preprocessing_fn
is called before the code that creates the artifact executes. So technically, it's impossible to use an artifact inside my preprocessing_fn
function.
My question is, is there a way around this? Can I use the results of an artifact inside my preprocessing_fn
function? How can I write a transformation that its logic is based on the results of the previous artifacts?