I have two huge hdf5 files, each with an index of ids, and each containing different information about each of those ids.
I have read one into a small masked dataset (data), using only a select few ids. I now want to add to the dataset, using information about those select ids from one column ('a') of the second hdf5 file (s_data).
Currently I am having to read though the entire 2nd hdf5 file and select ids that match, as per:
for i in range(len(data['ids'])):
print(i)
data['a'][i] = s_data['a'][s_data['ids'] == data['ids'][i]]
Now for 190million ids, this takes an uncomfortably long time. Is here a simpler way to match them? I'm thinking a pandas style join, however I can't find a way for this to work with h5py datasets.
Many thanks in advance!