0

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.

Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.

However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).

Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:

transform = tfx.components.Transform(
  examples=example_gen.outputs['examples'],
  schema=schema_gen.outputs['schema'],
    disable_statistics=False,
  splits_config= transform_pb2.SplitsConfig(
        analyze=['train'], transform=['train', 'eval']),
  module_file=_transformer_module_file)

I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....

      disable_statistics: If True, do not invoke TFDV to compute pre-transform
        and post-transform statistics. When statistics are computed, they will
        will be stored in the `pre_transform_feature_stats/` and
        `post_transform_feature_stats/` subfolders of the `transform_graph`
        export.

For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.

Thanks for any help

  • I tried to get splits for post_transform_stats running taxi notebook (https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/components_keras.ipynb) but unsuccessfully – Youcef Kacer Feb 07 '23 at 16:30

1 Answers1

0

This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.

To generate separate statistics for different splits, please use StatisticsGen component.

halfer
  • 19,824
  • 17
  • 99
  • 186