So back in awkward v0 it was possible to do;
import awkward
dog = awkward.fromiter([[1., 2.], [5.]])
cat = awkward.fromiter([[4], [3]])
dict_of_arrays = {'dog': dog, 'cat': cat}
awkward.save("pets.awkd", dict_of_arrays)
Then we could lazy load the array
reloaded_data = awkward.load("pets.awkd")
# no data in ram
double_dog = reloaded_data["dog"]*2
# dog is in ram but not cat
In short with have a dataset consisting of 'dog' and 'cat' parts. The whole dataset saves to one file on disk. Even if I didn't have any documentation, it would be obvious what data is dog and what is cat. Dog and cat load as awkward arrays. I can load the data and work with just one part without the other part ending up in the ram.
I'm looking for the best way to do this in awkward v1. The requirements I would like to meet are;
- The data consists of multiple named parts, with irregular shapes.
- All items in one named part have the same data type, different parts may have different data types.
- Some sort of lazy loading needs to be possible, working on bits of the data as awkward1 arrays without the whole thing.
- Ideally, the names of the parts are unambiguously associated with the data for each part. Dict structure is good for this, but other things could work.
- Ideally, the whole dataset saves and loads from one file without speed penalty.
- Ideally, when the array is loaded it has the right type, so in the example dog is a
float
array and cat is anint
array.
I had a look at awkward1.to_parquet
and while it looks good it seems to be just for saving one array. This dosn't fit well with the need to hold multiple data types, and I'm not sure how I'd record the column names.
I suppose I could convert back to awkward v0 and save that way but I'm not sure how that would play with lazy loading. It might be that I need to write a wrapper to do these things, which would be totally fine, but I wanted to check first if there is something built in that I should know about.
Edit; the answer given works great. For completeness I wanted to leave an example of using it;
In [1]: import awkward1 as ak
In [2]: dog = ak.from_iter([[1., 2.], [5.]])
...: cat = ak.from_iter([[4], [3]])
In [3]: ak.zip?
In [4]: pets = ak.zip({"dog": dog, "cat": cat}, depth_limit=1)
In [5]: pets.dog
Out[5]: <Array [[1, 2], [5]] type='2 * var * float64'>
In [6]: pets.cat
Out[6]: <Array [[4], [3]] type='2 * var * int64'>
In [7]: ak.to_parquet(pets, "pets.parquet")