I am trying to implement a Feature Selection component with the following plan in mind:
The implementation
- Component takes and
InputArtifact[Example]
as input - Since the data is stored in the form of TFRecords in the URI of the input artifact, I convert it into compatible numpy dictionaries and use sklearn to come up with the list of features selected
- I delete the required features from the input example directly to produce put it in
OutputArtifact[Example]
(which has the same structure but fewer columns)
I am done with the first and second point, but am not able to figure out how to delete the selected columns directly in the TFRecord Dataset itself (which I am getting using tf.data.TFRecordDataset(train_uri, compression_type='GZIP')
)