I am trying to fit a TensorFlow model and one my features comes in as a comma-separated string of ints (possibly empty string). The feature appears in the pretransform schema as
feature {
name: "csstring"
type: BYTES
presence {
min_fraction: 1.0
}
shape {
dim {
size: 1
}
}
}
and in the preprocessing_fn
function it is processed via
splitted = tf.squeeze(tf.strings.split(inputs["csstring"], sep=","), axis=1)
filled = tf.where(splitted=='', 'nan', splitted)
casted = tf.strings.to_number(filled)
meaned = tf.reduce_mean(casted, axis=1)
outputs["csstring"] = meaned
I have managed to load the pre-transformed examples in a notebook and apply these transformation steps to get the processed feature as the average of each list (nan
if the list is empty).
However when I run the pipeline as a whole on Kubeflow I am getting this error where the transform component fails:
ValueError: An error occured while trying to apply the transformation: "StringToNumberOp could not correctly convert string:
[[node transform/transform/StringToNumber_1 (defined at venv/lib/python3.8/site-packages/tensorflow_transform/saved/saved_transform_io.py:262) ]]
I can't see any particular string instance that would be problematic to cast, and would appreciate any ideas as to why the pipeline doesn't work.