Is there a way to convert a folder comprising .jpeg images to hdf5 in Python? I am trying to build a neural network model for classification of images. Thanks!
-
Short answer: yes. You have to convert the images to NumPy array data (with opencv or another tool). Do you know how to do that? – kcw78 Mar 15 '21 at 03:00
-
It will be nice if you can suggest pointers to do that. Essentially, if I can convert each image in my folder to a numpy array, that will be great! – Nanda Mar 15 '21 at 03:06
2 Answers
There are a lot of ways to process and save image data. Here are 2 variations of a method that reads all of the image files in 1 folder and loads into a HDF5 file. Outline of this process:
- Count the number of images (used to size the dataset).
- Create HDF5 file (prefixed:
1ds_
) - Create empty dataset with appropriate shape and type (integers)
- Use
glob.iglob()
to loop over images. Then do:- Read with
cv2.imread()
- Resize with
cv2.resize()
- Copy to the dataset
img_ds[cnt:cnt+1:,:,:]
- Read with
This is ONE way to do it. Additional things to consider:
- I loaded all images in 1 dataset. If you you have different size images, you must resize the images. If you don't want to resize, you need to save each image in a different dataset (same process, but create a new dataset inside the loop). See the second
with/as:
and loop that saves the data to the 2nd HDF5 (prefixed:nds_
) - I didn't try to capture image names. You could do that with attributes on 1 dataset, or as the dataset name with multiple datasets.
- My images are
.ppm
files, so you need to modify the glob functions to use*.jpg
.
Simpler Version Below (added Mar 16 2021):
Assumes all files are in the current folder, AND loads all resized images to one dataset (named 'images'). See previous code for the second method that loads each image in separate dataset without resizing.
import sys
import glob
import h5py
import cv2
IMG_WIDTH = 30
IMG_HEIGHT = 30
h5file = 'import_images.h5'
nfiles = len(glob.glob('./*.ppm'))
print(f'count of image files nfiles={nfiles}')
# resize all images and load into a single dataset
with h5py.File(h5file,'w') as h5f:
img_ds = h5f.create_dataset('images',shape=(nfiles, IMG_WIDTH, IMG_HEIGHT,3), dtype=int)
for cnt, ifile in enumerate(glob.iglob('./*.ppm')) :
img = cv2.imread(ifile, cv2.IMREAD_COLOR)
# or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
img_resize = cv2.resize( img, (IMG_WIDTH, IMG_HEIGHT) )
img_ds[cnt:cnt+1:,:,:] = img_resize
Previous Code Below (from Mar 15 2021):
import sys
import glob
import h5py
import cv2
IMG_WIDTH = 30
IMG_HEIGHT = 30
# Check command-line arguments
if len(sys.argv) != 3:
sys.exit("Usage: python load_images_to_hdf5.py data_directory model.h5")
print ('data_dir =', sys.argv[1])
data_dir = sys.argv[1]
print ('Save model to:', sys.argv[2])
h5file = sys.argv[2]
nfiles = len(glob.glob(data_dir + '/*.ppm'))
print(f'Reading dir: {data_dir}; nfiles={nfiles}')
# resize all images and load into a single dataset
with h5py.File('1ds_'+h5file,'w') as h5f:
img_ds = h5f.create_dataset('images',shape=(nfiles, IMG_WIDTH, IMG_HEIGHT,3), dtype=int)
for cnt, ifile in enumerate(glob.iglob(data_dir + '/*.ppm')) :
img = cv2.imread(ifile, cv2.IMREAD_COLOR)
# or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
img_resize = cv2.resize( img, (IMG_WIDTH, IMG_HEIGHT) )
img_ds[cnt:cnt+1:,:,:] = img_resize
# load each image into a separate dataset (image NOT resized)
with h5py.File('nds_'+h5file,'w') as h5f:
for cnt, ifile in enumerate(glob.iglob(data_dir + '/*.ppm')) :
img = cv2.imread(ifile, cv2.IMREAD_COLOR)
# or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
img_ds = h5f.create_dataset('images_'+f'{cnt+1:03}', data=img)

- 7,131
- 3
- 12
- 44
-
Thank you very much! I want to resize all images and load into a single dataset. I have a couple of questions, though. Where/how can I include the path where my folder is present? I am not Python-efficient (yet), so sorry it my question is silly. – Nanda Mar 16 '21 at 01:58
-
I reused some code that uses command line arguments. The first argument is the folder with the files, and the second is HDF5 filename. This way you can read images from any directory and assign the HDF5 file name and location. I modified my post to simplify the code and ONLY load resized images. Eventually you will want to make it more general purpose. – kcw78 Mar 16 '21 at 16:38
You can solve your issue by doing the following using HDFql in Python (HDFql also supports C, C++, Java, C#, R and Fortran):
import HDFql
cursor = HDFql.Cursor()
folder = "/home/dummy/images/"
HDFql.execute("create and use file images.h5")
HDFql.execute("show file \"%s\"" % folder)
while HDFql.cursor_next() == HDFql.SUCCESS:
file = HDFql.cursor_get_char()
print("File found: \"%s\"" % file)
HDFql.cursor_use(cursor)
HDFql.execute("show file size \"%s%s\"" % (folder, file))
HDFql.cursor_next()
size = HDFql.cursor_get_bigint()
HDFql.cursor_use_default()
HDFql.execute("create dataset \"%s\" as opaque(%d) values from binary file \"%s%s\"" % (file, size, (folder, file)))
HDFql.execute("close file")
For additional info check the reference manual and examples that illustrate HDFql functionalities.

- 876
- 6
- 10