I am working with the cnn_dailymail dataset which is part of the TensorFlow Datasets. My goal is to tokenize the dataset after applying some text preprocessing steps to it.
I access and preprocess the dataset as follows:
!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf
import tensorflow_datasets as tfds
data, info = tfds.load('cnn_dailymail', with_info=True)
train_data, test_data = data['train'], data['test']
def map_fn(x, start=tf.constant('<start>'), end=tf.constant('<end>')):
strings = [start, x['highlights'], end]
x['highlights'] = tf.strings.join(strings, separator=' ')
return x
train_data_preproc = train_data.map(map_fn)
elem, = train_data_preproc.take(1)
elem['highlights'].numpy()
# b'<start> mother announced as imedeen ambassador . ...
In order to tokenize the dataset, I came across the tfds.features.text.Tokenizer function (see also here). However, this does not behave the way I want it to:
tokenizer = tfds.features.text.Tokenizer(alphanum_only=False, reserved_tokens=['<start>', '<end>'])
tokenizer.tokenize(elem['highlights'].numpy())
# ['<start>', ' ', 'mother', ' ', 'announced', ' ', 'as', ' ', 'imedeen', ' ', 'ambassador', ' . ',...]
I would want the tokenizer to simply split on whitespaces rather than consider whitespaces as separate tokens. Is there a way to achieve this? Would it be best if I created my own tokenizer function and then apply it using the dataset.map()
function? Thanks!