I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed
function. When I try to compute()
, I get the following message from dask, and the traceback below.
distributed.protocol.pickle - INFO - Failed to serialize
([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling
The corresponding part of the source looks like this:
delayed(to_arrow)(vf.fetch(..), ordered_dict)
vf
is the file-like object, and vf.fetch(..)
returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam
library to read it). I hope this provides sufficient context.
The message from dask
shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..))
, which seems to work. Now I'm stumped, what am I missing?
Providing a minimal working example for this is a bit difficult. But maybe the following helps.
- Download a VCF file (and it's index) from here: say,
ALL.chrY*vcf.gz{,.tbi}
pip3 install --user pysam
- Open the file:
vf = VariantFile('/path/to/file.vcf.gz', mode='r')
- Something like this as the iterator:
vf.fetch("Y", 2_600_000, 2_700_000)
- For the delayed function, you could have an empty loop.