Hi I've trying to get a TFX Pipeline going just as an exercise really. I'm using ImportExampleGen
to load TFRecords
from disk. Each Example
in the TFRecord
contains a jpg in the form of a byte string, height, width, depth, steering and throttle labels.
I'm trying to use StatisticsGen
but I'm receiving this warning;
WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string.
and crashing my Colab Notebook. As far as I can tell all the byte-string images in the TFRecord are not corrupt.
I cannot find concrete examples on StatisticsGen
and handling image data. According to the docs Tensorflow Data Validation can deal with image data.
In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.
But I'm not sure how this fits in with StatisticsGen
.
Here is the code that instantiates an ImportExampleGen
then the StatisticsGen
from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from tfx.proto import example_gen_pb2
examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param
# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio = 10-train_ratio
output = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(splits=[
example_gen_pb2.SplitConfig.Split(name='train',
hash_buckets=train_ratio),
example_gen_pb2.SplitConfig.Split(name='eval',
hash_buckets=eval_ratio)
]))
example_gen = ImportExampleGen(input=examples,
output_config=output)
context.run(example_gen)
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
Thanks in advance.