TFX StatisticsGen for image data

Question

Hi I've trying to get a TFX Pipeline going just as an exercise really. I'm using ImportExampleGen to load TFRecords from disk. Each Example in the TFRecord contains a jpg in the form of a byte string, height, width, depth, steering and throttle labels.

I'm trying to use StatisticsGen but I'm receiving this warning; WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string. and crashing my Colab Notebook. As far as I can tell all the byte-string images in the TFRecord are not corrupt.

I cannot find concrete examples on StatisticsGen and handling image data. According to the docs Tensorflow Data Validation can deal with image data.

In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.

But I'm not sure how this fits in with StatisticsGen.

Here is the code that instantiates an ImportExampleGen then the StatisticsGen

from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from  tfx.proto import example_gen_pb2

examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param

# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio  = 10-train_ratio
output = example_gen_pb2.Output(
             split_config=example_gen_pb2.SplitConfig(splits=[
                 example_gen_pb2.SplitConfig.Split(name='train',
                                                   hash_buckets=train_ratio),
                 example_gen_pb2.SplitConfig.Split(name='eval',
                                                   hash_buckets=eval_ratio)
             ]))
example_gen = ImportExampleGen(input=examples,
                               output_config=output)
context.run(example_gen)

statistics_gen = StatisticsGen(
    examples=example_gen.outputs['examples'])
context.run(statistics_gen)

Thanks in advance.

Update: I have been doing some digging. The [TFX StatisticsGen Docs](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/StatisticsGen#class_statisticsgen) leans on [tfx.data_validation](https://www.tensorflow.org/tfx/guide/tfdv) which lead me to try this; `stats = tfdv.generate_statistics_from_tfrecord(data_location=tfrecords_filename)` which results in the same warning s and crashing of Colab. Getting closet to the root of the issue I guess. — Joshua Patterson, Mar 08 '20 at 00:26
Hmmmm ok, so I found a CIFAR 10 [example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/cifar10) where a tfrecord has been created already. When I use it to create a StatisticsGen I get the same warning and my Google Colab crashed. Maybe it's just Colab getting overwhelmed with text output. Maybe I can change the log level. See if that helps. — Joshua Patterson, Mar 09 '20 at 00:56
Have you been able to find a solution to this issue? I have the same error message. — BioGeek, Mar 19 '20 at 12:05
Had a similar issue, the fix was to update to tfx 21.2, that got things working interactively in a notebook as above. Updating to 21.2 also got things working running on kubeflow (make sure to update your dockerfile i've adapted this example similarly to above to use ImportExampleGen https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_interactive.ipynb) — Darren Brien, Mar 29 '20 at 16:59
So there seems to be an issue with TFX and Colab at the moment. I am unable to even run the standard tfx components.ipynb. I have raised an [issue](https://github.com/tensorflow/tfx/issues/1595). Will try again when the issue is resolved. — Joshua Patterson, Apr 08 '20 at 01:10
Sad times @DarrenBrien. Even with `tfx==0.21.2` I still a bazillion `Warnings` then my Colab tab becomes unresponsive. I'm going to try to figure out how to change the log level so the cell output doesn't spam me with warnings. — Joshua Patterson, Apr 10 '20 at 04:44

score 5 · Accepted Answer · answered Apr 22 '20 at 12:19

From git issue response Thanks Evan Rosen

Hi Folks,

The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can't be serialized. We'll look into finding a better default that handles image data more elegantly.

In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from @tall-josh:

Open In Colab

The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

# Load autogenerated schema (using stats from small batch)

schema = tfx.utils.io_utils.SchemaReader().read(
    tfx.utils.io_utils.get_only_uri_in_dir(
        tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))

# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
  if feature.name == 'image/raw':
    feature.image_domain.SetInParent()

# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
    os.path.join(user_schema_dir, 'schema.pbtxt'), schema)

# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
    instance_name='import_user_schema',
    source_uri=user_schema_dir,
    artifact_type=tfx.types.standard_artifacts.Schema)

# Run the user schema ImportNode
context.run(user_schema_importer)

Hopefully you find this workaround is useful. In the meantime, we'll take a look at a better default experience for image-valued features.

score 1 · Answer 2 · answered Apr 11 '20 at 15:54

1

Groked this and found the solution to be dramatically simpler than i thought...

from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
import logging
...
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
...
context = InteractiveContext(pipeline_name='my_pipe')
...
c = StatisticsGen(...)
...
context.run(c)

answered Apr 11 '20 at 15:54

Darren Brien

115
7

Thanks [@DarrenBrien](https://stackoverflow.com/users/5873699/darren-brien). That worked in the sense that I can continue running the notebook without overwhelming the cell output with warning messages. Ideally, we'll figure out how to properly encode/decode image data so the warning is never triggered. I have posted an [issue](https://github.com/tensorflow/tfx/issues/1622) surrounding how to correctly encode/decode image data so StatisticsGen actually works. – Joshua Patterson Apr 12 '20 at 05:17
For anyone interested [Run in Colab](https://colab.research.google.com/github/tall-josh/tfx-dingocar/blob/git-issue/TFX_dingocar.ipynb). Run everything down to the **StatisticeGen** Cells – Joshua Patterson Apr 12 '20 at 05:40

TFX StatisticsGen for image data

2 Answers2