I'm trying to augment the imdb movie reviews dataset by adding a random swap of some words. Unlike with image data, I don't think this function is originally in tensorflow. For example with images, you could do something like
def transform(image, label):
image = tf.image.flip_left_right(image)
return image, label
Where you use tensorflow's native functions for flipping images. But for augmenting text, I don't see anything that can do that in tf.string. So I am using the Easy Data Augmentation implementation from textaugment. https://github.com/dsfsi/textaugment
EG:
try:
import textaugment
except ModuleNotFoundError:
!pip install textaugment
import textaugment
from textaugment import EDA
import nltk
nltk.download('stopwords')
t = EDA()
t.random_swap("John is going to town")
Returns "John going to town is"
But now when I try to use this random_swap command to augment the entire imdb reviews dataset, it runs into an error because it's trying to act on tensors.
Example:
try:
import textaugment
except ModuleNotFoundError:
!pip install textaugment
import textaugment
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.datasets import imdb
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 1
runs = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
from textaugment import EDA
import nltk
nltk.download('stopwords')
t = EDA()
for text in x_train:
text = t.random_swap(text)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-7fc9edb2f37b> in <module>()
1 for text in x_train:
----> 2 text = t.random_swap(text)
1 frames
/usr/local/lib/python3.7/dist-packages/textaugment/eda.py in validate(**kwargs)
72 raise TypeError("p must be a fraction between 0 and 1")
73 if 'sentence' in kwargs:
---> 74 if not isinstance(kwargs['sentence'].strip(), str) or len(kwargs['sentence'].strip()) == 0:
75 raise TypeError("sentence must be a valid sentence")
76 if 'n' in kwargs:
AttributeError: 'numpy.ndarray' object has no attribute 'strip'
So how do you augment data in TensorFlow, when the native commands don't exist and you want to make a custom function to do the augmentation?