How to spilt class from Python generator

Question

I am working on a dataset named PlantVillage dataset for plant disease classification task, the output is multiclass and the example naming of class is 'tomato___healthy'. I want to split the class and make one data having two class (in above example, it will have tomato as one class and healthy as another class) to do multitasking. Below is what I am trying.

First, I defined the batchsize and using dataset_from_directory to fetch the data and autolabel.

    BATCH_SIZE = 32
    IMG_SIZE = (255, 255)
    
    data_dir = "/content/plantvillage dataset/color"
    train_dataset = image_dataset_from_directory(data_dir,
                                                 shuffle=True,
                                                 label_mode = 'categorical',
                                                 validation_split = 0.2,
                                                 batch_size=BATCH_SIZE,
                                                 seed = 42,
                                                 subset = "training",
                                                 image_size=IMG_SIZE)
    
    validation_dataset = image_dataset_from_directory(data_dir,
                                                 shuffle=True,
                                                 label_mode = 'categorical',
                                                 validation_split = 0.2,
                                                 batch_size=BATCH_SIZE,
                                                 seed = 42,
                                                 subset = "validation",
                                                 image_size=IMG_SIZE)

Second, I tried to use the code below to fetch the data and also the labels:

y = np.concatenate([y for x, y in validation_dataset], axis=0)
x = np.concatenate([x for x, y in validation_dataset], axis=0)

Last, I want to use this to generate two class for one data:

def generate_data(x, y, batch_size=32):
  num_examples = len(y)

  while True:
    x_batch = np.zeros((batch_size, 255, 255, 3))
    y_batch = np.zeros((batch_size,))
    c_batch = np.zeros((batch_size,))

    for i in range(0, batch_size):
      index = np.random.randint(0, num_examples)
      image, specie, disease = x[index], y[index].split('___')[0], y[index].split('___')[1]
      x_batch[i] = image
      y_batch[i] = specie
      c_batch[i] = disease 

    yield x_batch, [y_batch, c_batch]

My files structure is as follows:

color/
  -Tomato___healthy/
     - iweoqwd.jpg
     - weqwjeh.jpg 
  -Tomato___Tomato_Yellow_Leaf_Curl_Virus/
     - iweoqwd.jpg
     - weqwjeh.jpg

I am stuck on the second step which is because of the memory crush issue. The method cant seem to handle well with the memory and may I ask how can I overcome it and is there any other easier way to split the class into two for every data.

To support customization, I would suggest you using the `tf.data.Dataset` API. With the `map` function you will essentially be able to tweak the labels as you want. https://stackoverflow.com/a/63459031/10319735 might help you with the first steps. Here is how you use map https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map — Aritra Roy Gosthipaty, Nov 25 '21 at 06:28
@AritraRoyGosthipaty Thank you for your suggestion, I will have a look. Thanks again — Hanyi Koh, Nov 25 '21 at 06:54

How to spilt class from Python generator

0 Answers0