We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline. I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
- Load image data and meta-data
- Filter out ‘bad’ data based on meta-data
- Determine image based statistics (classic image processing in Python):
- Image level characteristics
- Image region characteristics (region is determined based on a fine-tuned EfficientDet model)
- Filter out ‘bad’ data based on image statistics
- Generate TFRecords from this image and meta-data
- Oversample certain TFRecords for class balancing (using tf.data)
- Train an image classifier
- …
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
- If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
- Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)? If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again. This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting). So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.