Is there a method to yield numpy arrays from image_dataset_from_directory()?

Question

I'm modifying the example from Keras (https://keras.io/examples/vision/semantic_image_clustering/#selfsupervised-representation-learning) with a custom dataset. Their example uses np arrays for the training and test data.

So far, I've been able to modify the example using image_dataset_from_directory() generators. However, in the last few steps it looks as though I need to input labels and corresponding learned nearest neighbours. Since I can't index a tf.Dataset I seem to be stuck. It also requires multiple inputs and from past experience the generator isn't great for this.

My major problem is I cannot convert the datasets to numpy arrays. They are too large to fit in memory, hence the generator.

Is there a method to create a custom batch generator from directory? My directory is sorted for each class to have its own folder - really I just want to generator to be able to yield numpy arrays where I can split into x_train, y_train .etc. I have made a custom generator from dataframes before with multiple inputs, but my implementation seemed slow and I haven't quite figured out how to get the generated nearest neighbours to work with this yet. Appreciate any brighter ideas or tips!

The original method to generate the data in the example is:

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

This doesn't work on a batched dataset (example below) which returns a type error that a batchdataset is not callable:

train_generator = tf.keras.utils.image_dataset_from_directory(
    train_data_dir,
    image_size=(img_width, img_height),
    batch_size=batch_size,
    labels='inferred',
    seed=42)

(x_train, y_train) = train_generator()

I have also tried with ImageDataGenerator but this doesn't seem to yield the correct tuple for me to split into images and labels.

I've also tried unbatching the dataset and iterating over it, various examples below - but my dataset is large, hence using the generator, and it doesn't fit into memory. I know there are many methods to convert the dataset to arrays, but I have about 400,000 images and 4000 classes just in this test example, which will increase to around 900,000 when I try the whole dataset. My preferred solution is to have something similar to last code here for ease of working with the Keras tutorial.

train_ds = train_generator.unbatch()
images = train_ds.map(lambda x, y: x)
labels = train_ds.map(lambda x, y: y)

x_train=np.concatenate([train_generator.next()[0] for i in range(train_generator.__len__())])
y_train=np.concatenate([train_generator.next()[1] for i in range(train_generator.__len__())])

For reference, this is the final step in the modelling, where the input is the nearest neighbours from a previous step and the anchor (image). I knew the train_generator wouldn't work but I threw it in anyway out of curiousity to see what the error was,

def create_clustering_learner(clustering_model):
    anchor = keras.Input(shape=input_shape, name="anchors")
    neighbours = keras.Input(
        shape=tuple([k_neighbours]) + input_shape, name="neighbours"
    )

    # Changes neighbours shape to [batch_size * k_neighbours, width, height, channels]
    neighbours_reshaped = tf.reshape(neighbours, shape=tuple([-1]) + input_shape)

    # anchor_clustering shape: [batch_size, num_clusters]
    anchor_clustering = clustering_model(anchor)

    # neighbours_clustering shape: [batch_size * k_neighbours, num_clusters]
    neighbours_clustering = clustering_model(neighbours_reshaped)

    # Convert neighbours_clustering shape to [batch_size, k_neighbours, num_clusters]
    neighbours_clustering = tf.reshape(
        neighbours_clustering,
        shape=(-1, k_neighbours, tf.shape(neighbours_clustering)[-1]),
    )

    # similarity shape: [batch_size, 1, k_neighbours]
    similarity = tf.linalg.einsum(
        "bij,bkj->bik", tf.expand_dims(anchor_clustering, axis=1), neighbours_clustering
    )

    # similarity shape:  [batch_size, k_neighbours]
    similarity = layers.Lambda(lambda x: tf.squeeze(x, axis=1), name="similarity")(
        similarity
    )

    # Create the model.
    model = keras.Model(
        inputs=[anchor, neighbours],
        outputs=[similarity, anchor_clustering],
        name="clustering_learner",
    )
    return model

# If tune_encoder_during_clustering is set to False,
# then freeze the encoder weights.
for layer in encoder.layers:
    layer.trainable = tune_encoder_during_clustering
    
# Create the clustering model and learner.
clustering_model = create_clustering_model(encoder, num_clusters, name="clustering")
clustering_learner = create_clustering_learner(clustering_model)

# Instantiate the model losses.
losses = [ClustersConsistencyLoss(), ClustersEntropyLoss(entropy_loss_weight=5)]

# Create the model inputs and labels.
inputs = {"anchors": train_generator, "neighbours": neighbours}
labels = tf.ones(shape=(train_generator))

# Compile the model.
clustering_learner.compile(
    optimizer=tfa.optimizers.AdamW(learning_rate=0.0005, weight_decay=0.0001),
    loss=losses,
)

# Begin training the model.
clustering_learner.fit(train_generator, batch_size=512, epochs=50)

ValueError: Attempt to convert a value (<BatchDataset element_spec=(TensorSpec(shape=(None, 220, 220, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>) to a Tensor.

Or if I comment out the labels since the generator is supposed to yield them anyway:

ValueError: Layer "clustering_learner" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, 220, 220, 3) dtype=float32>]

And there is no method to input anything else as y argument is invalid when using a dataset.

Is there a method to yield numpy arrays from image_dataset_from_directory()?

0 Answers0