0

I have two huge hdf5 files, each with an index of ids, and each containing different information about each of those ids.

I have read one into a small masked dataset (data), using only a select few ids. I now want to add to the dataset, using information about those select ids from one column ('a') of the second hdf5 file (s_data).

Currently I am having to read though the entire 2nd hdf5 file and select ids that match, as per:

for i in range(len(data['ids'])):
        print(i)
        data['a'][i] = s_data['a'][s_data['ids'] == data['ids'][i]]

Now for 190million ids, this takes an uncomfortably long time. Is here a simpler way to match them? I'm thinking a pandas style join, however I can't find a way for this to work with h5py datasets.

Many thanks in advance!

kcw78
  • 7,131
  • 3
  • 12
  • 44
tom davison
  • 112
  • 6

1 Answers1

0

Have you considered PyTables? It is another Python package to read HDF5 files. It has fast search algorithms based on OPSI (Optimized Partially Sorted Indexes). Using the .read_where() method with a search condition will simplify the search process and should be faster than h5py.

Your question is similar another I answered last week about finding duplicates. You can read my answer here:
Pytables duplicates 2.5 giga rows

Before searching, I would get an array of unique values from the 'ids' field in 'data' to use in the .read_where() condition to search 'sdata'. If I understand your process and data, the code would look like this:

import tables as tb
# need to open HDF5 files  
h5f1 = tb.File('yourfile1.h5','r')
h5f2 = tb.File('yourfile2.h5','r')
# define data and sdata datasets:
data  = h5f1.root.data
sdata = h5f2.root.sdata

# Step 1: Get a Numpy array of the 'ids' field/column from data DS: 
ids_arr = data.read(field='ids')
# Step 2: Get new array with unique values only: 
uids_arr = np.unique(ids_arr)     

#Or, combine steps 1 and 2 into one line: 
uids_arr = np.unique(data.read(field='ids')) 

# Step 3a: Loop on rows of unique id values 
for id_test in uids_arr :

# Step 3b: Get an array with all rows that match this id value, 
#          Only returns values in field 'a' 
     match_row_arr = sdata.read_where('ids==id_test',field='a')
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thank you for your detailed response. That seems to be close to exactly the sort of thing I was looking for. Many thanks for your help! – tom davison Apr 01 '20 at 15:08