Convert 10^9 x 2 uint32 H5py dataset to list of tuples

Question

I have data in a long HDF5 file, and the class I would like to use (igraph.Graph) seems to insist on a list of tuples in its init function. I have tried for loops, list(dataset), read_direct(dataset).tolist(), and [mylist.append(tuple(x) for x in dataset]. All of them have been too slow to be useful. So far, things have mostly been CPU bound, although there is some waiting for I/O, the the 40G RAM + 40G swap I am working with can be limiting. It seems strange to me if there is not a fast way to do this, but maybe it is a sign that it is time to move to C/C++.

(I know that questions about going from numpy arrays to lists have been asked. My problem is at a large enough scale that those solutions seem to be too slow.)

I don't know about h5py, but starting from a numpy array `A`, how does `A.view(','.join(2*(A.dtype.str,))).ravel().tolist()` compare? — Paul Panzer, Dec 01 '18 at 21:07
Let me check. Converting to a numpy array takes about 1-2 minutes. — Zach Boyd, Dec 01 '18 at 21:08
`h5py` loads datasets as `numpy` arrays. It will be hard to get anything faster than `arr = dataset[:]`. `alist = arr.tolist()` is also relatively fast. `list(arr)` is slower because it is making a list of arrays, rather than a list of lists. `alist = [tuple(x) for x in arr.tolist()]` is, in my experience, the fastest way to get a list of tuples. — hpaulj, Dec 01 '18 at 21:09
@hpaulj I am afraid that is the conclusion I will eventually reach as well. I looks like it will take a minimum of several hours this way. — Zach Boyd, Dec 01 '18 at 21:11
What's the slow step(s)? Loading the array, or conversion to lists? — hpaulj, Dec 01 '18 at 21:12
Loading the array to memory with read_direct is only 1-2 minutes. All the other approaches seem to be CPU-bound, so presumably translating all the numpy data to python data, creating pointers, etc. is the bottleneck. — Zach Boyd, Dec 01 '18 at 21:13
@PaulPanzer so far the approach you gave is memory-bound. I'm not sure if there is a way to tell if it is faster than the other approaches except waiting longer. Does it convert everything to strings and then to lists? — Zach Boyd, Dec 01 '18 at 21:18
It creates a structured array with dtype `(D, D)` where D is the original dtype. The elements of such an array are records, in this case consisting of two of the original elments each. The potentially neat thing here is that `tolist` converts array elements back to the closest pure Python type availble which in case of a record is a tuple. Correction: record is not the right word, but the principle applies. — Paul Panzer, Dec 01 '18 at 21:24
That makes sense. In any case, the OS eventually killed the process. I'm not sure why--maybe it asks for large, contiguous memory allocations? — Zach Boyd, Dec 01 '18 at 21:35
I'm afraid that is beyond me. Does your machine actually have enough RAM? Because Python tuples are 48 bytes each, not counting their content. And floats are 24 bytes each, so that would add up to 96GB. — Paul Panzer, Dec 01 '18 at 21:43
The contents are all uint32 in the numpy array, so depending on what Python's int type looks like, this may be slightly better. I didn't know there was so much overhead for each tuple. The array is actually 1,323,236,806 long though... So even if the CPU overhead can be worked out, there must be no in-memory solution here. So I may just need to use a different class or mover out of python. The RAM+swap is about 90G — Zach Boyd, Dec 01 '18 at 21:48
Both `int` and `uint32` are 28 bytes on my machine/OS (you can check with sys.getsizeof), Unless you expect a significant amount of duplicate pairs, yeah, I'm afraid there doesn't seem to be an obvious way out in Python. — Paul Panzer, Dec 01 '18 at 22:01

Convert 10^9 x 2 uint32 H5py dataset to list of tuples

0 Answers0