1

I was doing some groupby parallel computation with dask using pyarrow to load parquet files from s3. However, the same piece of code may run or fail (with different error messages) with random chances. Same issue happened when using fastparquet:

File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2309). Detail: Python exception: ssl.SSLError

or failing with different error:

File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2309). Detail: Python exception: ssl.SSLError

The dask scheduler I was using is processes. It works fine with threads but will be extremely slow. Is this behavior expected for dask?

zhh210
  • 388
  • 4
  • 12
  • Sounds like a bug to be reported to arrow. If you post the exception for fastparquet and more detail of your s3 setup and how you are calling dask, I may be able to help. – mdurant Feb 27 '20 at 21:04

0 Answers0