3

I tried to read parquet from s3 like this:

import dask.dataframe as dd

s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
        s3_path,
        storage_options={
                          "client_kwargs": {
                              "endpoint_url": bucket_endpoint_url,
                          },
                          "profile_name": bucket_profile,
                        },
        engine='pyarrow',
    )

It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()

My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?

Thanks,

Sean Nguyen
  • 12,528
  • 22
  • 74
  • 113

1 Answers1

2

Currently if there is no metadata file and if you're using the PyArrow backend then Dask is probably sending a request to read metadata from each of the individual partitions on S3. This is quite slow.

Dask's dataframe parquet reader is being rewritten now to help address this. You might consider using fastparquet until then and the ignore_divisions keyword (or something like that), or checking back in a month or two.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • What's the status of this today? It still seems very slow – bschreck Sep 24 '19 at 16:41
  • Things have improved and people report significantly faster reading in these cases. Additionally Arrow has improved and now supports writing/reading metadata files. If you're running into performance issues then I recommend producing an MCVE, and a profile and raise an issue on github. – MRocklin Sep 25 '19 at 17:07