TensorFlow Extended data_accessor.tf_dataset_factory() shape discrepancies

Question

I am facing a perplexing issue while attempting to convert a vanilla tensorflow/keras workflow into a tensorflow extended pipeline.

In short: the datasets generated using tfx’s ExampleGen component have different shapes from those created manually using tf.data.Dataset.from_tensor_slices() from the same data, and cannot be fed into a keras model.

Reproducible example

1. Data generation

Let’s assume we create a sample dataset using:

import pandas as pd
import random

df = pd.DataFrame({
    'a': [float(x) for x in range(100)],
    'b': [float(x + 1) for x in range(100)],
    'c': [float(x**2) for x in range(100)],
    'target': [random.randint(0, 2) for _ in range(100)],
})

df.to_parquet({my_path})

2. Model generation

Let's use a dummy dense model for simplicity's sake.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

def build_model():

    model = Sequential()
    model.add(Dense(8, input_shape=(3,), activation='relu'))
    model.add(Dense(3, activation='softmax'))

    model.compile(
        optimizer=SGD(),
        loss="sparse_categorical_crossentropy",
        metrics=["sparse_categorical_accuracy"],
    )

    return model

3. What works: manual dataset creation

This parquet file can then be loaded back into a pandas df and converted into a tensorflow dataset using:

import tensorflow as tf

_BATCH_SIZE = 4

dataset = tf.data.Dataset.from_tensor_slices((
    tf.cast(df[['a', 'b', 'c']].values, tf.float32),
    tf.cast(df['target'].values, tf.int32),
)).batch(_BATCH_SIZE, drop_remainder=True)

This gives a dataset with cardinality() = <tf.Tensor: shape=(), dtype=int64, numpy=25>, which can be fed to the toy model above.

4. What doesn't work: making a tensorflow extended pipeline

I have tried to replicate those results by applying a slightly modified tfx starter pipeline:

from tfx_bsl.tfxio import dataset_options
from tfx.components import SchemaGen
from tfx.components import StatisticsGen
from tfx.components import Trainer
from tfx.dsl.components.base import executor_spec
from tfx.components.example_gen.component import FileBasedExampleGen
from tfx.components.example_gen.custom_executors import parquet_executor
from tfx.components.trainer.executor import GenericExecutor
from tfx.orchestration import metadata
from tfx.orchestration import pipeline
from tfx.proto import trainer_pb2
from tfx.proto import example_gen_pb2
from tfx.utils.io_utils import parse_pbtxt_file


_BATCH_SIZE = 4
_LABEL_KEY = 'target'
_EPOCHS = 10


def _input_fn(file_pattern, data_accessor, schema) -> Dataset:

    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        dataset_options.TensorFlowDatasetOptions(
            batch_size=_BATCH_SIZE,
            label_key=_LABEL_KEY,
            num_epochs=_EPOCHS,
        ),
        schema,
    )
    
    return dataset


def build_model():
    """Same as above"""

    ...

    return model


def run_fn(fn_args):
    schema = parse_pbtxt_file(fn_args.schema_file, schema_pb2.Schema())
    
    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        schema,
    )
    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        schema,
    )

    model = build_model()
    model.fit(
        train_dataset,
        steps_per_epoch=fn_args.train_steps,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps,
        epochs=_EPOCHS,
    )

    model.save(fn_args.serving_model_dir, save_format='tf')


def _create_pipeline(
    pipeline_name: str,
    pipeline_root: str,
    data_root: str,
    module_file: str,
    metadata_path: str,
    split: dict,
) -> pipeline.Pipeline:

    split_config = example_gen_pb2.SplitConfig(
        splits=[
            example_gen_pb2.SplitConfig.Split(name=name, hash_buckets=buckets)
            for name, buckets in split.items()
        ]
    )

    example_gen = FileBasedExampleGen(
        input_base=data_root,
        custom_executor_spec=executor_spec.ExecutorClassSpec(parquet_executor.Executor),
        output_config=example_gen_pb2.Output(split_config=split_config),
    )

    statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
    infer_schema = SchemaGen(statistics=statistics_gen.outputs['statistics'])

    trainer = Trainer(
        module_file=module_file,
        custom_executor_spec=executor_spec.ExecutorClassSpec(GenericExecutor),
        examples=example_gen.outputs['examples'],
        schema=infer_schema.outputs['schema'],
        train_args=trainer_pb2.TrainArgs(),
        eval_args=trainer_pb2.EvalArgs()
    )

    components = [example_gen, statistics_gen, infer_schema, trainer]
    metadata_config = metadata.sqlite_metadata_connection_config(metadata_path)

    _pipeline = pipeline.Pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        components=components,
        metadata_connection_config=metadata_config,
    )

    return _pipeline

However, the dataset generated by ExampleGen has cardinality tf.Tensor(-2, shape=(), dtype=int64), and gives the following error message when fed to the same model:

ValueError: Layer sequential expects 1 inputs, but it received 3 input tensors. Inputs received: [<tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f40353373d0>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f4035337710>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f40352e3190>]

Importantly: the problem persists even when the data are stored as a csv file and read using CsvExampleGen, which makes the issue very unlikely to arise from the data themselves.

Also, batching the tfx output dataset has no effect on the results.

I’ve tried everything I could think of to no benefit. The relative obscurity of what's happening under tfx's hood doesn't help with the debugging either. Has anyone any idea of what the problem is?

Edit 1

Two points have come to my attention since writing this question:

data_accessor.tf_dataset_factory() doesn't actually output a tensorflow.python.data.ops.dataset_ops.TensorSliceDataset, but a tensorflow.python.data.ops.dataset_ops.PrefetchDataset instead.
There's actually a small bunch of as yet unanswered questions that look somewhat related to my problem discussing the pains of working with PrefetchDatasets:

TFDS Audio Preprocessing PrefetchDataset Problems

How to feed tf.prefetch dataset into LSTM?

Change PrefetchDataset shapes

Considering none of those questions have found an answer, and that the crux of the problem seems to be the lack of documentation regarding PrefetchDatasets and how to use them, I'll open an issue on tfx's board and see how it goes if this doesn't get answered here within a few days.

Edit 2: version and environment details

As requested by TensorFlow Support, here are the details regarding the versions of all my TensorFlow-related installs:

Core components:
- tensorflow==2.3.0
- tfx==0.25.0
- tfx-bsl==0.25.0
TensorFlow-related stuff:
- tensorflow-cloud==0.1.7
- tensorflow-data-validation==0.25.0
- tensorflow-datasets==3.0.0
- tensorflow-estimator==2.3.0
- tensorflow-hub==0.9.0
- tensorflow-io==0.15.0
- tensorflow-metadata==0.25.0
- tensorflow-model-analysis==0.25.0
- tensorflow-probability==0.11.0
- tensorflow-serving-api==2.3.0
- tensorflow-transform==0.25.0
Environment and other miscellaneous details:
- Python version: 3.7.9
- OS: Debian GNU/Linux 10 (buster)
- Running from an N1 GCP instance

Please mention the versions you are using for tfx and tensorflow. — , Apr 27 '21 at 14:44