11

I wish to write a function in TensorFlow 2.0 than shuffles data and their target labels before each training iteration.

Let's say I have two numpy datasets, X and y, representing data and labels for classification. How can I shuffle them at the same time?

Using sklearn it's pretty easy:

from sklearn.utils import shuffle
X, y = shuffle(X, y)

How can I do the same in TensorFlow 2.0 ? The only tool I found in the documentation is tf.random.shuffle, but it takes only one object at a time, I need to feed two.

Leevo
  • 1,683
  • 2
  • 17
  • 34

3 Answers3

6

Instead of shuffling x and y , its much easier to shuffle their indices, so first generate a list of indices

indices = tf.range(start=0, limit=tf.shape(x_data)[0], dtype=tf.int32)

then shuffle these indices

idx = tf.random.shuffle(indices)

and use these indices to shuffle the data

x_data = tf.gather(x_data, idx)
y_data = tf.gather(y_data, idx)

and youll have shuffled data

Imtinan Azhar
  • 1,725
  • 10
  • 26
3

First convert them into tf.data.Dataset type.

x_train = tf.data.Dataset.from_tensor_slices(x)
y_train = tf.data.Dataset.from_tensor_slices(y)

Once done that, you can simply shuffle them:

x_train, y_train = x_train.shuffle(buffer_size=2, seed=2), y_train.shuffle(buffer_size=2, seed=2)
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

Use same seed in both of training variables so that you can shuffle your data without losing features-target relation. You can even create a function to shuffle:

BF = 2
SEED = 2
def shuffling(dataset, bf, seed_number):
   return dataset.shuffle(buffer_size=bf, seed=seed_number)

x_train, y_train = shuffling(x_train, BF, SEED), shuffling(y_train, BF, SEED)
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
Omar
  • 1,029
  • 2
  • 13
  • 33
  • I get the following error when I try to combine two datasets `dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))`: `ValueError: Slicing dataset elements is not supported for rank 0`. – Mykola Zotko Jun 30 '22 at 17:18
2

If you just want to shuffle two arrays in the same way, you can do:

import tensorflow as tf

# Assuming X and y are initially NumPy arrays
X = tf.convert_to_tensor(X)
y = tf.convert_to_tensor(y)
# Make random permutation
perm = tf.random.shuffle(tf.range(tf.shape(X)[0]))
# Reorder according to permutation
X = tf.gather(X, perm, axis=0)
y = tf.gather(y, perm, axis=0)

However, you may consider using a tf.data.Dataset, which already provides a shuffle method.

import tensorflow as tf

# You may use a placeholder if in graph mode
# (see https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays)
ds = tf.data.Dataset.from_tensor_slices((X, y))
# Shuffle with some buffer size (len(X) will use a buffer as big as X)
ds = ds.shuffle(buffer_size=len(X))
jdehesa
  • 58,456
  • 7
  • 77
  • 121