I'm trying to build a pytorch
project on an IterableDataset
with zarr
as storage backend.
class Data(IterableDataset):
def __init__(self, path, start=None, end=None):
super(Data, self).__init__()
store = zarr.DirectoryStore(path)
self.array = zarr.open(store, mode='r')
if start is None:
start = 0
if end is None:
end = self.array.shape[0]
assert end > start
self.start = start
self.end = end
def __iter__(self):
return islice(self.array, self.start, self.end)
This works quite nicely with small test-datasets but once i move to my actual dataset (480 000 000 x 290) i'm running into a memory leak. I've tried logging out the python heap periodically as everything slows to a crawl, but i couldn't see anything increasing in size abnormally, so the lib i used (pympler
) didn't actually catch the memory leak.
I'm kind of at my wits end, so if anybody has any idea how to further debug this, it would be greatly appreciated.
Cross-posted on PyTorch Forums.