What are the best practices to store/read data to/from TFRecord files to train a forecasting model? I want to build a model that can predict the health of individual machines (for example, an electric motor) based on their historical health data (for example, the historical data from a fleet of motors including each motor speed, error rate, breakdown, etc).
I can do the entire preprocessing (normalize the data, impute missing values, engineer new features, split to train/validate/test sets, etc) with Apache Beam/Dataflow. But I was thinking maybe it'd be better to store the raw data as .tfrecord files and use TFX to do the normalization, imputation, etc to make experimentation easier. TFX tensorflow_transform currently doesn't support tf.SequenceExample files. Therefore, I was thinking to store the raw data as tf.Example files with each record in the following format:
example_proto = tf.train.Example(features=tf.train.Features(feature={
'timestamp': tf.train.Feature(int64_list=tf.train.Int64List(
value=[1601200000, 1601200060, 1601200120, ...])),
'feature0': tf.train.Feature(float_list=tf.train.FloatList(
value=[np.nan, 15523.0, np.nan, ...])),
'feature1': tf.train.Feature(float_list=tf.train.FloatList(
value=[1.0, -8.0, np.nan, ...])),
...
'label': tf.train.Feature(float_list=tf.train.FloatList(
value=[0.5, -10.3, 2.1, ...])),
}))
What do you think? Any tips?