2

I have had experience of using Spark in the past and honestly, coming from a predominantly python background, it was quite a big leap. When I saw dask, I thought this would be a much better solution for distributed computing, however I seem to be running into an issue with simply reading in parquet files.

I have an s3 bucket, say s3://some-bucket/. And within this I have my parquet files saved as hive paritioning, as shown below.

├── year=2019
    └── month=01
    └── month=01
    └── ....     
└── year=2020
    └── month=01
        └── day=01
            ├── datasource=browser
                └── ...
            └── datasource=iphone
                └── part2.parquet
                └── part2.parquet
        └── ...

There is no _metadata or _common_metadata file.

To demonstrate how much slower reading these are in dask compared to Spark I can provide an example. In Spark, the following takes ~ 90 seconds:

df = spark.read.parquet("s3://some-bucket/year=2020/month=04/")


df.createOrReplaceTempView("df")

res = spark.sql("SELECT column1, SUM(column2) FROM df GROUP BY column1")

res = res.toPandas()  # convert to pandas to be fair to dask, but we know this isn't fast in spark

The exact same thing in dask takes ~ 800 seconds:

cluster = YarnCluster()

# Connect to the cluster
client = Client(cluster)

cluster.scale(10)

df = dd.read_parquet("s3://some-bucket/year=2020/month=04/*/*/*.parquet",
                     columns=["column1", "column2"], 
                     engine="pyarrow",
                     gather_statistics=False,
)

res = df.groupby('column1').column2.sum().compute()

Using "s3://some-bucket/year=2020/month=04/" rather than "s3://some-bucket/year=2020/month=04/*/*/*.parquet", takes ~2100 seconds.

One thing I have already tried, is reading the metadata of just one of the small bits of parquet, extracting the pyarrow schema, and passing this through as a kwarg along with validate_schema=False. Like so:

import pyarrow.parquet as pq
import s3fs

s3 = s3fs.S3FileSystem()

# we pick a small part of the dataset
ds = pq.ParquetDataset(
    "s3://some-bucket/year=2020/month=04/day=01/datasource=iphone/part1.parquet",
filesystem=s3)

# read in the table and write it out to a temp parquet
tbl = ds.read()
pq.write_table(tbl, "tmp.parquet")

# read in the metadata using pyarrow and extract the pyarrow._parquet.Schema
pa_metadata = pq.read_metadata("tmp.parquet")
pa_schema = pa_metadata.schema

df_dask = dd.read_parquet("s3://some-bucket/year=2020/month=04/day=01/*/*.parquet",
                     columns=["column1", "column2"], 
                     engine="pyarrow",
                     gather_statistics=False,
                     dataset=dict(validate_schema=False, schema=pa_schema)
)

With this method, looking at only one days partition, I see signifcant speed up (~4x). As soon as I look at a month's worth of data, my workers get killed off (I assume because dask is trying to read too much data into one specific node?).

I am not able to change the structure of my data unfortunately. Having read this, I realise if I had a _metadata or _common_metadata file, then I would see significant speed up. However, this is not possible for me

So, why is dask so much slower than Spark in this particular use case? And more specifically, is there anything I can do to speed up the reading of the parquet files in dask?

Extra details * There are a total of over 1000 columns. * Data is saved out using Java (we can't changed this) * Package versions - dask==2.15.0, dask-yarn==0.8.1, distributed==2.15.2, pyarrow==0.17.0 * Each day=* is typically about 6.4gb on hard disk. With the largest datasource=* partition being about 1.5gb. Individual parquet files are between 13MB and 150MB * We tried fastparquet as an alternative to the pyarrow engine, but this was slower than pyarrow

bgrantham
  • 281
  • 4
  • 12
  • Can you please specify your versions of dask. distributed and arrow. Also, what is the typical size (bytes and rows) of each part and data types of the columns? – mdurant May 06 '20 at 14:11
  • (also, out of curiosity, how well does fastparquet do?) – mdurant May 06 '20 at 14:13
  • @mdurant colum1 is a 64 bit integer and column 2 is a 1 bit integer `dask==2.15.0` `dask-yarn==0.8.1` `distributed==2.15.2` `pyarrow==0.17.0`. `fastparquet` is much, much slower. I gave up timing it in the end. I'm not 100% on the typical sizes. Let me get some actual numbers for you – bgrantham May 06 '20 at 14:15
  • @mdurant it was quciker than I thought to get typical sizes. Each `day=*` is typically about 6.4gb on hard disk. With the largest `datasource=*` partition being about 1.5gb. Individual parquet files are between 13MB and 150MB – bgrantham May 06 '20 at 14:21
  • There are several columns, or just the two? It would be good to put this info into the question. I don't know when I can answer, but you may want to start by looking at the information in the dask dashboard, status and profile tabs. – mdurant May 06 '20 at 14:33
  • Furthermore, it would be good, if possible, to specify how the data were created (sorry!) – mdurant May 06 '20 at 14:41
  • @mdurant I will also add all this detail to the question. There are 1000s of columns, but as parquet is column based and I'm only using 2 (by using the `columns` kwarg) I assumed that didn't matter. The data is saved out using Java. Not sure of the exact details as it's generated by a different team – bgrantham May 07 '20 at 09:29

0 Answers0