0

the more I work with python, the more I want to do it the pythonic way, i.e. try to avoid isinstance queries etc. I am developing a framework for scientific parameter exploration for numeric simulation.

I work with two hard constraints: First, I need to be able to repeat a specific simulation I stored to disk exactly the way it was run for the first time. Secondly, data should be nice and readable in hdf5/pytables format.

Both concepts are a bit opposing. As soon as I store stuff as pytables data, some information is lost. For instance, after reloading a python int becomes a numpy.int64. This can be problematic if you want to rerun numerical simulations, because there are cases where a multiplication of a variable with an int works fine but breaks when used with numpy.int64. This is the case, for example, if you use the spiking neural network simulator BRIAN.

So a straightforward answer would be to simply pickle everything and get it back as it was before. Yet, this doesn't work well with the idea of readable hdf5 array or table data.

So what I have in mind is to store the original data format as an hdf5 attribute to the corresponding table or array and use this information to reconstruct the original data. First, is that a good idea in general, and second, what is the most pythonic way to do that?

For instance, what about this:

original_class_name = data.__class__.__name__

# do the storage of data and the original_class_name

{...}

# load the hdf5data and reconstruct it

reconstructed_data = eval(original_class_name+'(hdf5data)')

To me this seems too hacky.

SmCaterpillar
  • 6,683
  • 7
  • 42
  • 70
  • Why not serialize the objects with `pickle`, and have an extra file in hdf5 format for those who nead the human readable version? – Michael Foukarakis Sep 09 '13 at 09:43
  • This would be too much overhead. I am talking here about data from giga to terabytes. This makes me wonder anyway, whether the whole issue of reconstruction and type-checking is useful when it comes to very large data. I handle everything right now by being extremely restrictive, i.e. by type checking of data and refusing it directly in the beginning if it does not belong to a set of very native data (numpy and python native stuff). – SmCaterpillar Sep 09 '13 at 10:00
  • At any rate, proper serialization should take precedence, I think. If you want human-readable output, you can easily produce it from a serialized set of objects (just read them in and pretty-print them). – Michael Foukarakis Sep 09 '13 at 10:16
  • I agree, pickle is the wrong way to do this. However, your problem seems to be of disparate data types. Python ints are 32 bit yet you are storing and reloading them as 64 bit ints. Does the code work if you store and reload them as numpy.int32 instead? I bet it would. – Anthony Scopatz Sep 12 '13 at 00:28

0 Answers0