1

This issue is tricky and hard to explain clearly, because it is not quite deterministic, and the code can't be entirely listed. There are a couple variations, some works and others don't, and I can't see the difference in the code that has bearings on the issue.

I have 1000 Parquet files in Google Cloud Storage, each file about 17Mb in size. I loop through the blobs, initiate a ParquetFile object on each, and print out a little info. After a few hundred files, it fails with this:

  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 319, in __init__
    source = filesystem.open_input_file(source)
  File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetObjectMetadata: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

My code creating each ParquetFile is roughly this:

path = "gs://..."
fs = GcsFileSystem(access_token='...', credential_token_expiration=...)
file = ParquetFile(path, filesystem=fs)

Any pointers are welcome! Thanks!!

zpz
  • 354
  • 1
  • 3
  • 16
  • Do you get a different result if, instead of creating ParquetFile instances for each path, you use pyarrow.parquet.read_table("path/to/data.parquet", filesystem=fs) to open each file? – amoeba Apr 14 '23 at 16:30
  • 1
    I did not try that because my design is to not load the whole file, but rather only get the meta info via ParquetFile first. I can try `read_table` a little later. – zpz Apr 14 '23 at 16:40
  • 1
    @amoeba I tried `read_table` after `ParquetFile` failed and got the same error. Apparently `read_table` eventually got to the same point in code – zpz Apr 15 '23 at 21:09
  • Either some of your files are corrupted or there's an issue with GCS. You may want to copy the files locally and try to open them to isolate where the issue comes from. – 0x26res Apr 18 '23 at 07:55
  • The data files are not corrupted. If I read the blobs into memory as bytes, then construct ParquetFile from the bytes, it works. My guess now is that the issue is in GcsFileSystem – zpz Apr 25 '23 at 03:35

0 Answers0