-2

I have a dataset composed as:

dataset = [{"sample":[numpy array (2048,3) shape], "category":"Cat"}, ....]

Each element of the list is a dictionary containing a key "sample" and its value is a numpy array that has shape (2048,3) and the category is the class of that sample. The dataset len is 8000.

I tried to save in JSON but it said it can't serialize numpy arrays.

What's the best way to save this list? I can't use np.save("file", dataset) because there is a dictionary and I can't use JSON because there is the numpy array. Should I use HDF5? What format should I use if I have to use the dataset for machine learning? Thanks!

kcw78
  • 7,131
  • 3
  • 12
  • 44
Klinda
  • 35
  • 6
  • How about turning the `np.arrays` into lists with `.tolist()`? Should be able to save it as JSON afterwards. – ouroboros1 Aug 26 '22 at 10:34
  • @ouroboros1 No because later I have to use it as a numpy list. – Klinda Aug 26 '22 at 10:41
  • is expensive convert a list (2048,3) to numpy? Maybe is the only solution .tolist() – Klinda Aug 26 '22 at 13:49
  • what about to_json method from Pandas? How can I use it ? – Klinda Aug 26 '22 at 14:10
  • `pickle` and `np.savez` can be used to save multiple arrays. – hpaulj Aug 26 '22 at 14:17
  • In the end to_json is doing tolist() to the numpy array, so yeah probably the best option is to convert it to a list and then after convert it again to numpy. – Klinda Aug 26 '22 at 14:27
  • There's no best way. There are just ways that satisfy your requirements and ways that don't. Without knowing what your requirements are beyond the trivial, or what you've already investigated, it's next to impossible to help you make an informed choice. – Mad Physicist Aug 27 '22 at 16:45
  • For example, you could add keys for the shape and dtype and dump the data to a bytes object. Then JSON might be more accepting, although you might have to apply some additional encoding to get around UTF-8 restrictions – Mad Physicist Aug 27 '22 at 16:47

1 Answers1

2

Creating an example specific to your data requires more details about the dictionaries in the list. I created an example that assumes every dictionary has:

  • A unique value for the category key. The value is used for the dataset name.
  • There is a sample key with the array you want to save.

Code below creates some data, loads to a HDF5 file with h5py package, then reads the data back into a new list of dictionaries. It is a good starting point for your problem.

import numpy as np
import h5py

a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)

dataset = [{"sample":arr1, "category":"Cat"}, 
           {"sample":arr2, "category":"Dog"},
           {"sample":arr3, "category":"Fish"},
           ]

# Create the HDF5 file with "category" as dataset name and "sample" as the data
with h5py.File('SO_73499414.h5', 'w') as h5f:
    for ds_dict in dataset:
        h5f.create_dataset(ds_dict["category"], data=ds_dict["sample"])

# Retrieve the HDF5 data with "category" as dataset name and "sample" as the data
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
    for ds_name in h5f:
        print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
        print(h5f[ds_name][()]) # prints the dataset values (as an array) 
        # add data and name to list
        ds_list.append({"sample":h5f[ds_name][()], "category":ds_name})

Here is a second method when category values aren't unique.

a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)
arr4 = np.arange(3*a0*a1,4*a0*a1).reshape(a0,a1)

dataset = [{"sample":arr1, "category":"Cat"}, 
           {"sample":arr2, "category":"Dog"},
           {"sample":arr3, "category":"Cat"},
           {"sample":arr4, "category":"Dog"}
           ]

# Create the HDF5 file with  dataset name using counter and "sample" as the data
# "category" is savee as a dataset attribute
with h5py.File('SO_73499414.h5', 'w') as h5f:
    for i, ds_dict in enumerate(dataset):
        ds = h5f.create_dataset(f'ds_{i:04}', data=ds_dict["sample"])
        ds.attrs["category"] = ds_dict["category"]

# Retrieve the HDF5 data with  "sample" as the data and "category" from the attribute
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
    for ds_name in h5f:
        print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
        print(h5f[ds_name].attrs["category"]) # prints the category attribute
        print(h5f[ds_name][()]) # prints the dataset values (as an array) 
        
        # add data and name to list
        ds_list.append({"sample":h5f[ds_name][()], "category":h5f[ds_name].attrs["category"]})
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • `h5f.create_dataset(ds_dict["category"], data=ds_dict["pointcloud"]) ValueError: Unable to create dataset (name already exists) `, the category is not unique, they could be more samples with the same category – Klinda Aug 26 '22 at 22:52
  • 1
    Yes, as I noted, my answer assumed unique category values (because they are used as the dataset names). I added a second method to my answer that creates unique datasets names and saves the category as a dataset attribute. – kcw78 Aug 27 '22 at 16:40
  • Thanks now it works, where can I find some resources about hdf5? Seems there aren't code examples. Json size of that dataset is 1 gb, with hdf5 370 mb and it saves also numpy arrays. Thank you very much. – Klinda Aug 28 '22 at 16:03
  • JSON vs HDF5 size advantage is text vs binary data. (HDF5 is binary.) HDF5 also has options for compression. (I didn't use it.) It will reduce the size, but may increase I/O time. No need if your file is only 370MB. To learn about HDF5, first read [The HDF Group "Introduction to HDF5"](https://portal.hdfgroup.org/display/HDF5/Learning+HDF5) to learn the basics of the data schema. Then, read the [h5py Quick Start Guide](https://docs.h5py.org/en/stable/quick.html) to work with HDF5 using Python. After you get started, StackOverflow has a lot of answers you will find useful. – kcw78 Aug 28 '22 at 16:54
  • Thanks, is possibile to read hdf5 with pyspark? I saw they did some implmentation but you need to accept the license and they provide you the link. But I think I have to change format for example parquet? to be more easy-read? – Klinda Aug 30 '22 at 19:19
  • I have no idea. I have never used pyspark. If you are familiar with numpy, h5py has the most intuitive implementation to access HDF5 data (IMHO). – kcw78 Aug 31 '22 at 01:14