How to apply tokenization to a TensorFlow Dataset?

Question

I am working with the cnn_dailymail dataset which is part of the TensorFlow Datasets. My goal is to tokenize the dataset after applying some text preprocessing steps to it.

I access and preprocess the dataset as follows:

!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf
import tensorflow_datasets as tfds

data, info = tfds.load('cnn_dailymail', with_info=True)
train_data, test_data = data['train'], data['test']

def map_fn(x, start=tf.constant('<start>'), end=tf.constant('<end>')):
   strings = [start, x['highlights'], end]
   x['highlights'] = tf.strings.join(strings, separator=' ')
   return x

train_data_preproc = train_data.map(map_fn)
elem, = train_data_preproc.take(1)
elem['highlights'].numpy()
# b'<start> mother announced as imedeen ambassador . ...

In order to tokenize the dataset, I came across the tfds.features.text.Tokenizer function (see also here). However, this does not behave the way I want it to:

tokenizer = tfds.features.text.Tokenizer(alphanum_only=False, reserved_tokens=['<start>', '<end>'])
tokenizer.tokenize(elem['highlights'].numpy())
# ['<start>', ' ', 'mother', ' ', 'announced', ' ', 'as', ' ', 'imedeen', ' ', 'ambassador', ' . ',...]

I would want the tokenizer to simply split on whitespaces rather than consider whitespaces as separate tokens. Is there a way to achieve this? Would it be best if I created my own tokenizer function and then apply it using the dataset.map() function? Thanks!

add this line `[x for x in a if x != ' ']`, it will remove all the space elements from your list. Here `a` is the variable in which you are taking the output of your tokenizer result. — vb_rises, May 28 '19 at 10:18
Thanks for the hint, @Vishal. If I use `[x for x in tokenizer.tokenize(elem['highlights'].numpy()) if x != ' ']` as my last line of code, then I get: `['', 'mother', 'announced', 'as', 'imedeen', 'ambassador', ' . ', ...`. This is still not exactly the same result as splitting based on whitespace (because of ' . ' token in this case). I was just wondering if the `tfds.features.text.Tokenizer` can be further customized. — Nadja Herger, May 28 '19 at 10:58
for the tokens like ' . ', you can modify to `[x.strip() for x in a if x != ' ']`, this will remove the extra spaces from start and end of tokens. As there are only 2 parameters in tfds.features.text.Tokenizer, I don't think that it can be customized. But there are other tokenizers if you are open to using them. — vb_rises, May 28 '19 at 11:10
Okay, thanks. I have already experimented a bit with `tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')`. I was just trying to avoid having to iterate through every single (sentence and) token as my corpus is very large. I will continue experimenting with other tokenizers. — Nadja Herger, May 28 '19 at 11:20

score -1 · Answer 1 · answered Dec 26 '19 at 06:05

For readers who hit this link...

Please find my gist which may help with tokenization in Tensorlfow.

Link: https://gist.github.com/Mageswaran1989/70fd26af52ca4afb86e611f84ac83e97#file-text_preprocessing-ipynb

There are different options available:

Tensorflow Dataset APIs : Tokenizer + Enoder
Tensorflow Keras Text Preprocessing : All in one Tokenizer
- API:https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer?version=stable
- Tutorial : https://www.tensorflow.org/tutorials/text/nmt_with_attention
In my try outs this stand out for simple and easy to use for both word and character level tokenizing and encoding/decoding
Tensorflow Text This is for more advanced usage with TF Dataset APIs and Keras Layers directly.

score -1 · Answer 2 · answered Jun 16 '20 at 15:01

For anyone who is struggling in this part, Tensorflow made it clear : https://www.tensorflow.org/tutorials/load_data/text#encode_text_lines_as_numbers

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

X = ... # list of string
y = ... # list of corresponding labels

train_data = tf.data.Dataset.from_tensor_slices((X, y))

# Building vocabulary set for tokenizer
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in train_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

# Encoding functions
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):
  # py_func doesn't set the shape of the returned tensors.
  encoded_text, label = tf.py_function(encode, 
                                       inp=[text, label], 
                                       Tout=(tf.int64, tf.int64))

  # `tf.data.Datasets` work best if all components have a shape set
  #  so set the shapes manually: 
  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label


train_data_tokenized = train_data.map(encode_map_fn)

where train_data is a tf.data.Dataset object composed of sentences and labels.

How to apply tokenization to a TensorFlow Dataset?

2 Answers2