0

My question is about creating an object type or document for .hdf5 files. The object will have three attributes, an id, a user_id and a boolean array of size 64. I have to create them about 10000000 (Ten millions) many.

Imagine mongodb, I have to use them like that. I have to make queries for some particular user_id'ed objects as well as for all of them.

Any suggestion and help is appreciated.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
A Ef
  • 13
  • 2
  • 8

1 Answers1

0

I would go ahead and use a dictionary for this case. I feel dictionaries do scale up well. Since the query would be on user_id, I would make it the key.

The structure would be like

{ 
    'user_id-xyz': {
        'id':'id-1212',
        'boolarray':[True,False,..],

    },
    'user_id-abc':{
        ...
    }
}

In order to achieve this, I might go for a numpy custom datatype.

element = np.dtype([('id', 'i16'), ('boolarray', 'b',(64,1))])
f = h5py.File('foo.hdf5','w')
dset = f.create_dataset("blocky", (1000000,), dtype='V79') # 64(bools)+15(for id)
grp = f.create_group("user_id-xyz")
# create subgroups for each id.
subdataset = grp.create_dataset('ele',(1,),dtype=element) 

# test of membership.
'user_id-xyz' in f
# retrieval
f.get('user_id-xyz')
# all keys.
f.keys()

Overall, I hope this helps you.

Vasif
  • 1,393
  • 10
  • 26
  • Ok but how do I store these objects? As far as I understand hdf5 uses numpy arrays to store. – A Ef Aug 24 '16 at 11:13
  • Okay. I didn't really notice the h5py tag. But then, Looking at docs, I would use a user_id as a group. and within that i would have have a bool array. First n digits representing the id. What is the size of id ? – Vasif Aug 24 '16 at 11:30
  • 2^10 or 2^15 most likely. – A Ef Aug 24 '16 at 11:33