1

I'm working on an app that is writing parquet files. For testing purposes, I'm trying to read a generated file with pd.read_parquet. I get a really strange error that asks for a schema:

self = <[AttributeError("'ParquetFile' object has no attribute '_schema'") raised in repr()] ParquetFile object at 0x7fae6e06b250>

This happen on the following line:

data = pd.read_parquet(file)

where file is the path to file from root content. First I'm not supposed to provide a schema as we're talking about parquet here and I'm not sure what could cause the issue. Maybe a readability clause ?

The generated file looks good when I imported it in my Parquet plugin for pycharm

{"Id": 12345, "Limit": 200, "Product": 818} {"Id": 67890, "Limit":3000, "Product": 819} So it shouldn't be an issue with the input data.

NB: Tried the same with fastparquet and got the same error (makes sense as pd.read_parquer is based on it.

Alex
  • 389
  • 4
  • 21
  • 2
    This sounds most likely like an environment issue. How are you installing pandas? Did you try with pyarrow as the read parquet engine? What version of libraries are you using? – Micah Kornfield Aug 25 '21 at 03:45
  • By importing it in the requirements section, using pandas==1.1.5. I tried with pyarrow and I think I got some issues saying reading 0 bytes files or something however the file isnt empty... Also using fastparquet==0.7.1 – Alex Aug 27 '21 at 18:06
  • Have you been able to read the file with any library (e.g. the java one) or via command line tools? It sounds like maybe the file is malformed? – Micah Kornfield Aug 27 '21 at 20:55
  • No I havent yet, I agree, I thaught about a permission issue but it doesnt look like it cause it's still about to enter the file information if it looks for the schema – Alex Aug 31 '21 at 12:16

1 Answers1

1

Same thing happened to me while I was doing it with a compression schema of

df.to_parquet("sample.parquet",compression="uncompressed")

I changed it to none. Then it started working.

df.to_parquet("sample.parquet",compression="none")

Maybe for your case environment is not setup correctly. Try installing other engines such as fastparquet or pyarrow.

tblaze
  • 138
  • 10