1

TLDR;

i'm facing an issue with the Evaluator model. All examples of using the Evaluator component use the label from the original ExampleGen data as the source of labels. But I want to give it labels that I compute during the pipeline.

Is there a way that I can one-hot encode the labels on the fly before giving it to the Evaluator? The alternative would be to one-hot encode the data and in the Transform component and then load it again with the ImportExampleGen component but that is very expensive for time and memory.


Long version:

I am running a language modeling pipeline, where I have a text as an input and I want to train an LSTM-based LM. My steps so far is:

  • Ingest the text data using ImportExampleGen and tokenize them using a vocab file
output = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(
        splits=[
            example_gen_pb2.SplitConfig.Split(name="train", hash_buckets=45),
            example_gen_pb2.SplitConfig.Split(name="eval", hash_buckets=5),
        ]
    )
)


# Load the data from our prepared TFDS folder
example_gen = ImportExampleGen(input_base=str(data_root), output_config=output)

context.run(example_gen)
  • Transform the text data into 2 tensors of MAX_LEN shape (with padding if needed). One for the input of the model and one for output (one-shifted).

This is how it looks after transformation:

{'label_sentence': array([17843,  1863, 30003,    32,     4, 30003, 30003, 30003, 30003,
       30003, 12551, 30003, 22696, 30003, 30003, 30003, 30003, 30003,
       30003,   210, 29697, 30003,  3813,  2262, 30003,   313,   370,
         667, 27087,   186,   182, 30003,   370, 10500,   186,   182,
       30003,   370,  8366,   186,   182, 30003,  9949,  1789, 30003,
       30003,   158,  1863, 30003,     8,  5169,     3,    67,  4229,
           3,   239,  3843, 30003,     5,   682,  1887, 28241, 30003,
       16798, 30003,   116,     4,   207,  1320,  1529, 30003,     2,]),

 'training_sentence': array([    1, 17843,  1863, 30003,    32,     4, 30003, 30003, 30003,
       30003, 30003, 12551, 30003, 22696, 30003, 30003, 30003, 30003,
       30003, 30003,   210, 29697, 30003,  3813,  2262, 30003,   313,
         370,   667, 27087,   186,   182, 30003,   370, 10500,   186,
         182, 30003,   370,  8366,   186,   182, 30003,  9949,  1789,
       30003, 30003,   158,  1863, 30003,     8,  5169,     3,    67,
        4229,     3,   239,  3843, 30003,     5,   682,  1887, 28241,
       30003, 16798, 30003,   116,     4,   207,  1320,  1529, 30003])}
  • During the training process I one-hot encode the labels on the fly (with vocab size of 30K) before the model ingests it (This is to save space in time compared to doing it in the Transform component).

Here's that part of the training code:

    train_dataset = train_dataset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))
    eval_dataset = eval_dataset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))

    
    mirrored_strategy = tf.distribute.MirroredStrategy()
    with mirrored_strategy.scope():
        model = get_model()

    tensorboard_callback = keras.callbacks.TensorBoard(
        log_dir=fn_args.model_run_dir, update_freq="batch"
    )

    model.fit(
        train_dataset,
        steps_per_epoch=fn_args.train_steps,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps,
        callbacks=[tensorboard_callback],
    )
  • Evaluation is where i'm facing an issue. All examples of using the Evaluator component use the label from the original ExampleGen data as the source of labels.

Is there a way that I can one-hot encode the labels on the fly before giving it to the Evaluator? The alternative would be to one-hot encode the data and in the Transform component and then load it again with the ImportExampleGen component but that is very expensive for time and memory.

0 Answers0