0

I am attempting to take folders of images and convert them to an hdf5 file type for use in a classification learning model. Each image in the folder should be paired with the name of the folder as its label for classification. I have 10 folders of images and they need to end up in one large hdf5 file.

I have searched online for answers to this question and really can't seem to find anything that is helping. The only thing I have seen is to create on with just a single folder, but it says nothing about multiple folders or how to append to an hdf5 file. It doesn't matter what language this is programmed in as I am only planning on running the script to create the hdf5 file.

For clarity, the end result should be an hdf5 file with two groups (one the label and the other the data). Each group should have an individual dataset for each image/label and matching datasets should have the same file name/number.

Update: I have found and modified code, found below, that iterates through a folder and creates the dataset for each image. However, I don't know how to add datasets to a specific group within the hdf5 file.

import sys
import glob
import h5py
import cv2

IMG_WIDTH = 30
IMG_HEIGHT = 30

h5file = 'test.h5'

nfiles = len(glob.glob('./*.jpeg'))
print(f'count of image files nfiles={nfiles}')
with h5py.File(h5file,'w') as  h5f:
    for x in range(nfiles):
        convert_num = str(x)
        img_ds = h5f.create_dataset(convert_num, shape=(nfiles, IMG_WIDTH, IMG_HEIGHT, 3), dtype=int)
    
  • Do the images all have the same size, as shown in your sample code? Are your labels simple integers? – Homer512 May 02 '23 at 07:10

1 Answers1

1

Your request is similar to this question and my answer: Convert a folder comprising jpeg images to hdf5. It shows 2 ways to load the image data:

  • The 1st method loads all of the images to 1 dataset. (creates file 1ds_)
  • The 2nd method loads each image to a different dataset. (creates file nds_)

It does not create groups to hold the image and label datasets, and does not create label datasets. Both are easy to add.

Since you are loading each image (and label) to a different dataset, you don't have to create empty datasets in advance. You simply loop over the images. I created a labels group, but don't load label data because you didn't say what it should be. If you just want the filename you could add as an attribute to the image dataset (which is what I did).

  • Create 'images' and 'labels' groups before looping on images
  • Use glob.iglob() to loop over images. In the loop:
    • Read with cv2.imread()
    • Resize with cv2.resize() (optional, assumes you want all images to be the same size)
    • Copy to the dataset ['images'][image_#]

Modified code from my previous answer:

IMG_WIDTH = 30
IMG_HEIGHT = 30    
h5file = 'test.h5'
    
with h5py.File('nds_'+h5file,'w') as h5f:
    img_grp = h5f.create_group('images')
    lbl_grp = h5f.create_group('labels')
    for cnt, ifile in enumerate(glob.iglob('./*.jpeg')):
        img = cv2.imread(ifile, cv2.IMREAD_COLOR)
        # or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
        # load each image into a separate dataset (image NOT resized)
        img_ds = img_grp.create_dataset('image_'+f'{cnt+1:03}', data=img)
        # OR resize image, then load to dataset 
        # img_resize = cv2.resize( img, (IMG_WIDTH, IMG_HEIGHT) )
        # img_ds = img_grp.create_dataset('image_'+f'{cnt+1:03}', data=img_resize)
        # add image name to the image dataset as an attribute 
        img_ds.attrs['name'] = ifile
kcw78
  • 7,131
  • 3
  • 12
  • 44