14

Now I use following function for shuffling

from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = data.TextLineDataset(filenames)
    dataset = dataset.map(decode_func)
    dataset = dataset.shuffle(buffer_size=10000)  # Equivalent to min_after_dequeue=10000.
    dataset = dataset.batch(batch_size)

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.

My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?

danche
  • 1,775
  • 15
  • 22
  • Maybe in future parts of your code you will transform to a `Tensor`? If the answer is yes, you can use `tf.random_shuffle`. – garciparedes Jun 28 '17 at 06:33
  • 1
    The part transfer to `Tensor` is just the `batch_part` rather than all data... – danche Jun 28 '17 at 07:07
  • Will creating a filename queue with shuffle like `tf.train.string_input_producer' before the data queue, address your problem? – Vijay Mariappan Jun 28 '17 at 07:20
  • Thks, but this will cause other problems, see https://stackoverflow.com/questions/44549245/how-to-use-tensorflow-tf-train-string-input-producer-to-produce-several-epochs-d – danche Jun 28 '17 at 07:39
  • Do you mind the shuffling being a preprocessing step before the model is trained? If not, look into the `shuf` unix command. – Nathan May 10 '18 at 19:58

2 Answers2

8

Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:

  1. Randomly shuffle the entire data once using a MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized files ("shards").
  2. In each epoch:

    a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).

    b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.

    c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.

zohar.kom
  • 1,765
  • 3
  • 12
  • 28
0

If you want to shuffle all data sets, you have this method:

Note: shuffle(dataset.cardinality()) loads the full dataset into memory so that it can be shuffled. This will cause a memory overflow (OOM) error if the dataset is too large, so full-shuffle should only be used for datasets that are known to fit in the memory, such as datasets of filenames or other small datasets.

But you can see that it will cause an overflow of memory (OOM), if you don't have enough memory.

So I did this method, to play with the memory.

Please Note:

  • I use it for displaying my test dataset or exploring.
  • I do not recommend using it for training and validation. Please use Tensorflow's native methods in this case.
    def tf_shuffle_dataset(dataset, batch_size, seed):
        """
        Shuffles a TensorFlow dataset memory-preservingly using a batch-based method and also shuffles the batches themselves.

        Args:
        - dataset (tf.data.Dataset): The input dataset to shuffle.
        - batch_size (int): Size of each batch.
        - seed (int, optional): Seed for shuffle reproducibility.

        Returns:
        - tf.data.Dataset: Shuffled dataset.

        Example:
        --------
        Let's consider a dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and batch_size = 2.

        1. The dataset is divided into the following batches:
           [1, 2], [3, 4], [5, 6], [7, 8], [9, 10]

        2. Each batch is shuffled. Let's assume the shuffled batches are:
           [2, 1], [4, 3], [6, 5], [8, 7], [10, 9] (Note: The actual shuffle might differ)

        3. The order of these shuffled batches is then shuffled. Let's assume the shuffled order is:
           [4, 3], [2, 1], [8, 7], [10, 9], [6, 5] (Note: The actual shuffle might differ)

        4. These batches are concatenated together to give the final shuffled dataset:
           [4, 3, 2, 1, 8, 7, 10, 9, 6, 5]
        """
        if not isinstance(dataset, tf.data.Dataset):
            raise ValueError("The provided dataset is not an instance of tf.data.Dataset.")

        # Split the dataset into batches
        num_elements = sum(1 for _ in dataset)
        num_batches = num_elements // batch_size

        batches = [dataset.skip(i * batch_size).take(batch_size) for i in range(num_batches)]

        # Shuffle each batch individually
        shuffled_batches = [batch.shuffle(buffer_size=batch_size, seed=seed) for batch in batches]

        # Shuffle the order of batches themselves
        batch_order = tf.random.shuffle(tf.range(num_batches), seed=seed)

        # Merge the shuffled batches to create the final dataset
        shuffled_dataset = shuffled_batches[0]
        for i in tqdm(batch_order[1:], desc="Shuffling dataset", unit="batch"):
            shuffled_dataset = shuffled_dataset.concatenate(shuffled_batches[i.numpy()])

        return shuffled_dataset
YanSte
  • 10,661
  • 3
  • 57
  • 53