1

I am trying to build a tf.data.Dataset pipeline that reads 16 tab separated .gzip files which include a sentence, a useless file indicator, and a label. I'd like to apply a tokenizer to the first axis of the dataset only. Additionally, I'd love to drop the middle axis Here is my code:

ds = tf.data.Dataset.list_files("references/reads/*.txt.gz")
ds = tf.data.TextLineDataset(filenames=ds, compression_type="GZIP", num_parallel_reads=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda x: tf.strings.split(x, "\t"), num_parallel_calls=tf.data.experimental.AUTOTUNE)

Here is the data:

>>> [print(a) for a in ds.take(2)]
tf.Tensor([b'Happy little sentence.'  b'Useless Text'  b'Label'], shape=(3,), dtype=string)

I'd like to apply my tokenizer to only the first axis of the tensor ('Happy little sentence.') Bonus points if I can delete 'Useless Text'. Here has been my unsuccessful approach:

with open('my_tokenizer.model', 'rb') as f_in:
    model = f_in.read()
s = text.SentencepieceTokenizer(model=model)
ds = ds.map(lambda x: s.tokenize(x), num_parallel_calls=tf.data.experimental.AUTOTUNE)

This tokenizes everything!

AloneTogether
  • 25,814
  • 5
  • 20
  • 39
Oliver
  • 281
  • 3
  • 14

1 Answers1

1

Assuming you always have 3 elements (a sentence, a useless file indicator, and a label) in each tensor, you could try indexing the first and last element:

import tensorflow as tf
import tensorflow_text as tf_text
import requests

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true"
model = requests.get(url).content

ds = tf.data.Dataset.from_tensor_slices(([['Happy little sentence.',  'Useless Text',  'Faust'],
                                        ['Happy little sentence1.',  'Useless Text1',  'Faust1'],
                                        ['Happy little sentence2.',  'Useless Text2',  'Faust2']]))

s = tf_text.SentencepieceTokenizer(model=model)
def transform_data(x):
  return s.tokenize(x[0]), x[2]

ds = ds.map(transform_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)

for d in ds:
  print(d)
(<tf.Tensor: shape=(11,), dtype=int32, numpy=array([  4, 165,  19,  29,  29,  34, 544, 331,  15, 256,   6], dtype=int32)>, <tf.Tensor: shape=(), dtype=string, numpy=b'Faust'>)
(<tf.Tensor: shape=(12,), dtype=int32, numpy=
array([  4, 165,  19,  29,  29,  34, 544, 331,  15, 256, 357,   6],
      dtype=int32)>, <tf.Tensor: shape=(), dtype=string, numpy=b'Faust1'>)
(<tf.Tensor: shape=(12,), dtype=int32, numpy=
array([  4, 165,  19,  29,  29,  34, 544, 331,  15, 256, 596,   6],
      dtype=int32)>, <tf.Tensor: shape=(), dtype=string, numpy=b'Faust2'>)
AloneTogether
  • 25,814
  • 5
  • 20
  • 39
  • Thank you so much for your help! However, that returns two tensor objects. Is there a way to return the same information but in one tensor? – Oliver Jan 31 '22 at 16:09
  • 1
    what do you mean exactly? They have two different data types and I thought the first tensor was your data and the last tensor was your labels – AloneTogether Jan 31 '22 at 16:12
  • you're right, bad idea :) – Oliver Jan 31 '22 at 16:18