0

I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed function. When I try to compute(), I get the following message from dask, and the traceback below.

distributed.protocol.pickle - INFO - Failed to serialize 
  ([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling

Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
    result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
  File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling

The corresponding part of the source looks like this:

delayed(to_arrow)(vf.fetch(..), ordered_dict)

vf is the file-like object, and vf.fetch(..) returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam library to read it). I hope this provides sufficient context.

The message from dask shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..)), which seems to work. Now I'm stumped, what am I missing?

Providing a minimal working example for this is a bit difficult. But maybe the following helps.

  1. Download a VCF file (and it's index) from here: say, ALL.chrY*vcf.gz{,.tbi}
  2. pip3 install --user pysam
  3. Open the file: vf = VariantFile('/path/to/file.vcf.gz', mode='r')
  4. Something like this as the iterator: vf.fetch("Y", 2_600_000, 2_700_000)
  5. For the delayed function, you could have an empty loop.
suvayu
  • 4,271
  • 2
  • 29
  • 35

1 Answers1

1

The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.

If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile, which is a serializable thing that defers opening until it is used in a with block. That's one convenient way to do it, but you can probably do something simpler.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look at `OpenFile`, thanks for the pointer. – suvayu Nov 18 '18 at 19:05
  • Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files. – mdurant Nov 18 '18 at 22:28