4

I created a dataframe from pandas and used to_parquet(...) to write to s3 directly.

arguments are:

df.to_parquet('s3://bucket/fn.parquet', compression='gzip', engine='fastparquet', partition_cols=['col1'])

when I use pandas's pandas.read_parquet(url), the dataframe is loaded fine.

But when I use modin.pandas.read_parquet(url), I get following error:

 File "/home/mguo/anaconda3/envs/testenv/lib/python3.7/site-packages/s3fs/core.py", line 1779, in __init__
    self.req_kw["IfMatch"] = self.details["ETag"]
KeyError: 'ETag'

Below are my version:

python==3.7.3
pandas==1.2.4
modin==0.10.0
s3fs==2021.6.0
michaelgbj
  • 290
  • 1
  • 10
  • Hi!, did you ever solved this issue? I am having the same problem when reading a parquet folder from S3, though I am using pyarrow instead of fastparquet. I see that this issue you references has been closed so I do not know if you ever find a solution for this. https://github.com/modin-project/modin/issues/3185 – jarias Nov 09 '21 at 15:10
  • hey, I moved on to a different solution (i.e. not to use modin) unfortunately. I can go back and check if is working down. – michaelgbj Nov 09 '21 at 19:16

1 Answers1

1

This issue on the Modin GitHub tracked support for reading partitioned files with read_parquet in Modin, as you are trying to do here. This pull request on the Modin GitHub added that feature and resolved the issue. You should be able to read partitioned parquet files without the ETag KeyError if you upgrade to the latest version of Modin (0.12.0).

  • I won't vote to delete, but this answer is all links. If you want the answer to stand, flesh it out a bit, giving at least a one-liner summary of what each link says, so the answer can stand and be useful even if/when those links die. – joanis Dec 09 '21 at 14:18
  • I added more description of each link, @joanis. Is that enough? – Mahesh Vashishtha Dec 09 '21 at 19:41