I am trying to build a tf.data.Dataset
pipeline that reads 16 tab separated .gzip
files which include a sentence, a useless file indicator, and a label. I'd like to apply a tokenizer to the first axis of the dataset only. Additionally, I'd love to drop the middle axis
Here is my code:
ds = tf.data.Dataset.list_files("references/reads/*.txt.gz")
ds = tf.data.TextLineDataset(filenames=ds, compression_type="GZIP", num_parallel_reads=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda x: tf.strings.split(x, "\t"), num_parallel_calls=tf.data.experimental.AUTOTUNE)
Here is the data:
>>> [print(a) for a in ds.take(2)]
tf.Tensor([b'Happy little sentence.' b'Useless Text' b'Label'], shape=(3,), dtype=string)
I'd like to apply my tokenizer to only the first axis of the tensor ('Happy little sentence.')
Bonus points if I can delete 'Useless Text'
. Here has been my unsuccessful approach:
with open('my_tokenizer.model', 'rb') as f_in:
model = f_in.read()
s = text.SentencepieceTokenizer(model=model)
ds = ds.map(lambda x: s.tokenize(x), num_parallel_calls=tf.data.experimental.AUTOTUNE)
This tokenizes everything!