1

I've got 1,000+ very long matlab vectors (varying lengths ~ 10^8 samples) representing data from different patients and sources. I wish to compactly organize them in one file for a later convenient access in python. I want each sample to somehow hold additional information (patient ID, sampling freq etc.).

Order should be:

Hospital 1:
   Pat. 1:
      vector:sample 1
      vector:sample 2

   Pat. 2:
      vector:sample 1
      vector:sample 2


Hospital 2:
   Pat. 1:
      vector:sample 1
      vector:sample 2
    .
    .
    .

I thought about converting samples to hdf5 filetype and add metadata, and then merge several hdf5 files into a single file, but I'm facing difficulties.

already tried:

Open for suggestions!

Shlomi Shmuel
  • 21
  • 1
  • 7
  • So you have thousands of files each hundreds of megabytes in size, and you thought it would be more convenient to combine them all into a single file of hundreds of gigabytes in size?what are you going to do with a file that size? – Cris Luengo Jan 01 '20 at 23:14
  • Well, yes! It's going to be used as a training set for ML, and the most convenient to transfer the data would be a single hdf5 file, which will be read in chunks. – Shlomi Shmuel Jan 02 '20 at 17:35

2 Answers2

1

Regarding the format that you have given above, you may want to store the vectors in a matrix. For patients sample with hospital: 2 ,pat_ID: 3455679, age: 34, high_blood_pressure: NO(0 binary), you could store it as "patient ID", "Hospital number","age","high_blood_pressure"... as 2,3455679,34,0,...

a = [1:10]' %vector 1
b = [1:10]' %vector 2
c = [a,b]   %matrix holding vecotrs 1 and 2
Σ baryon
  • 236
  • 1
  • 11
  • Could you please elaborate the proper way to insert strings, doubles and vectors into a single table/cell/dataframe? I'd like this file to be opened later in python, and taking into account that a single patient might have 3 vector samples. – Shlomi Shmuel Jan 01 '20 at 22:03
  • with more vectors c = [a,b,c,d,e,f,g] with every letter being a vector, will give you a matrix with rows corresponding to patient samples. To add anothing else to the matrix with ease you can store it in another vector say z and do c = [c,z] to add it to final comlumn, you can be more specific and add it to any row or column of the matrix you wish. – Σ baryon Jan 01 '20 at 22:10
0

I see at least 2 approaches with HDF5. You can copy all of your data into a single file. Gigabytes of data is not a problem for HDF5 (given sufficient resources). Alternately, you could save Patient data in separate files, and use External Links to point to the data from a central HDF5 file. After you create the links, you can access the data "as-if" it's in that file. Both methods shown below with small, simple "samples" created using Numpy random. Each sample is a single dataset, and includes attributes with the Hospital, Patient and Sample ID.

Method 1: All data in a single file

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f.create_dataset(ds_name, data=vec_arr )
                # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt

Method 2: External links to Patient data in separate files

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149_link.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
            h5f2 = h5py.File(fname, 'w')
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f2.create_dataset(ds_name, data=vec_arr )
            # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt
                h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
            h5f2.close()
kcw78
  • 7,131
  • 3
  • 12
  • 44