I tried to read parquet from s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
},
engine='pyarrow',
)
It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()
My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?
Thanks,