Create input/output sequences from separate columns in csv for seq2seq decoder in tensorflow

Question

I am trying to experiment with tensorflow seq2seq and I am having trouble coming up with a good way to add the "GO", "EOS" + "PAD" elements a sequence of labels. I am reading this data from .csv using tf.TextLineReader and the .csv I have created has a single column for Text, followed by 4 more columns for each of the sequential labels.

Here is the sample csv that I created sample_input.csv: "This is an example text message that we want to have",Label1,Label2,Label3, "Here is another sentences that we want to load",Label10,,,

And here is my code that reads this csv in:

import tensorflow as tf
tf.reset_default_graph()

_BATCH_SIZE = 2

sess = tf.InteractiveSession()

filename_queue = tf.train.string_input_producer(['sample_input.csv'])
reader = tf.TextLineReader(skip_header_lines=1)

_, rows = reader.read_up_to(filename_queue, num_records=_BATCH_SIZE)

row_columns = tf.expand_dims(rows, -1)
Text, Label1, Label2, Label3, Label4 = tf.decode_csv(
    row_columns, record_defaults=[[""],[""],[""],[""],[""]])

start = tf.constant(["START"] * _BATCH_SIZE)
start = tf.expand_dims(start, -1)
input_seq = tf.string_join(
    inputs = [start, Label1, Label2, Label3],separator = ", ")

output_seq = tf.string_join(
    inputs = [Label1, Label2, Label3, Label4],separator = ", ")
features = tf.stack([Text, input_seq, output_seq])

sess.run(tf.global_variables_initializer())
tf.train.start_queue_runners()
features.eval()

The example above will print out the following upon features.eval():

array([[[b'This is an example text message that we want to have'],
    [b'Here is another sentences that we want to load']],

    [[b'START, Label1, Label2, Label3'],[b'START, Label10, , ']],

    [[b'Label1, Label2, Label3, '], [b'Label10, , , ']]], dtype=object)

Now I know that this is not the right place to be creating these sequences, but I am hoping to get some suggestions on how to properly create the sequence. These sequences of 4 labels will be varying length and some may only have 1, while others may have 4. Ideally, my inputs would end up being

Single Label:

decoder_input = [GO, Label1, PAD, PAD]
decoder_output = [Label1, END, PAD, PAD]

Double Label:

decoder_input = [GO, Label1, Label2, PAD]
decoder_output = [Label1, Label2, END, PAD]

Three Labels:

decoder_input = [GO, Label1, Label2, Label3]
decoder_output = [Label1, Label2, Label3, END]

Four Labels:

decoder_input = [GO, Label1, Label2, Label3]
decoder_output = [Label1, Label2, Label3, Label4] *NOTE: no end sequence in this last one since it was already 4 elements long

Can anybody propose a better approach to creating the decoder input/output from four separate columns in a csv?

Create input/output sequences from separate columns in csv for seq2seq decoder in tensorflow

0 Answers0