modin pandas read_parquet() failed on ETag KeyError trying to read a partitioned parquet from s3

Question

I created a dataframe from pandas and used to_parquet(...) to write to s3 directly.

arguments are:

df.to_parquet('s3://bucket/fn.parquet', compression='gzip', engine='fastparquet', partition_cols=['col1'])

when I use pandas's pandas.read_parquet(url), the dataframe is loaded fine.

But when I use modin.pandas.read_parquet(url), I get following error:

 File "/home/mguo/anaconda3/envs/testenv/lib/python3.7/site-packages/s3fs/core.py", line 1779, in __init__
    self.req_kw["IfMatch"] = self.details["ETag"]
KeyError: 'ETag'

Below are my version:

python==3.7.3
pandas==1.2.4
modin==0.10.0
s3fs==2021.6.0

Hi!, did you ever solved this issue? I am having the same problem when reading a parquet folder from S3, though I am using pyarrow instead of fastparquet. I see that this issue you references has been closed so I do not know if you ever find a solution for this. https://github.com/modin-project/modin/issues/3185 — jarias, Nov 09 '21 at 15:10
hey, I moved on to a different solution (i.e. not to use modin) unfortunately. I can go back and check if is working down. — michaelgbj, Nov 09 '21 at 19:16

Mahesh Vashishtha · Answer 1 · 2021-12-09T19:41:45.143

1

This issue on the Modin GitHub tracked support for reading partitioned files with read_parquet in Modin, as you are trying to do here. This pull request on the Modin GitHub added that feature and resolved the issue. You should be able to read partitioned parquet files without the ETag KeyError if you upgrade to the latest version of Modin (0.12.0).

edited Dec 09 '21 at 19:41

answered Dec 08 '21 at 18:34

Mahesh Vashishtha

166
5

I won't vote to delete, but this answer is all links. If you want the answer to stand, flesh it out a bit, giving at least a one-liner summary of what each link says, so the answer can stand and be useful even if/when those links die. – joanis Dec 09 '21 at 14:18
I added more description of each link, @joanis. Is that enough? – Mahesh Vashishtha Dec 09 '21 at 19:41

modin pandas read_parquet() failed on ETag KeyError trying to read a partitioned parquet from s3

1 Answers1