I have had experience of using Spark in the past and honestly, coming from a predominantly python background, it was quite a big leap. When I saw dask, I thought this would be a much better solution for distributed computing, however I seem to be running into an issue with simply reading in parquet files.
I have an s3 bucket, say s3://some-bucket/
. And within this I have my parquet files saved as hive paritioning, as shown below.
├── year=2019
└── month=01
└── month=01
└── ....
└── year=2020
└── month=01
└── day=01
├── datasource=browser
└── ...
└── datasource=iphone
└── part2.parquet
└── part2.parquet
└── ...
There is no _metadata or _common_metadata file.
To demonstrate how much slower reading these are in dask
compared to Spark
I can provide an example.
In Spark, the following takes ~ 90 seconds:
df = spark.read.parquet("s3://some-bucket/year=2020/month=04/")
df.createOrReplaceTempView("df")
res = spark.sql("SELECT column1, SUM(column2) FROM df GROUP BY column1")
res = res.toPandas() # convert to pandas to be fair to dask, but we know this isn't fast in spark
The exact same thing in dask takes ~ 800 seconds:
cluster = YarnCluster()
# Connect to the cluster
client = Client(cluster)
cluster.scale(10)
df = dd.read_parquet("s3://some-bucket/year=2020/month=04/*/*/*.parquet",
columns=["column1", "column2"],
engine="pyarrow",
gather_statistics=False,
)
res = df.groupby('column1').column2.sum().compute()
Using "s3://some-bucket/year=2020/month=04/"
rather than "s3://some-bucket/year=2020/month=04/*/*/*.parquet"
, takes ~2100 seconds.
One thing I have already tried, is reading the metadata of just one of the small bits of parquet, extracting the pyarrow schema, and passing this through as a kwarg along with validate_schema=False
. Like so:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
# we pick a small part of the dataset
ds = pq.ParquetDataset(
"s3://some-bucket/year=2020/month=04/day=01/datasource=iphone/part1.parquet",
filesystem=s3)
# read in the table and write it out to a temp parquet
tbl = ds.read()
pq.write_table(tbl, "tmp.parquet")
# read in the metadata using pyarrow and extract the pyarrow._parquet.Schema
pa_metadata = pq.read_metadata("tmp.parquet")
pa_schema = pa_metadata.schema
df_dask = dd.read_parquet("s3://some-bucket/year=2020/month=04/day=01/*/*.parquet",
columns=["column1", "column2"],
engine="pyarrow",
gather_statistics=False,
dataset=dict(validate_schema=False, schema=pa_schema)
)
With this method, looking at only one days partition, I see signifcant speed up (~4x). As soon as I look at a month's worth of data, my workers get killed off (I assume because dask is trying to read too much data into one specific node?).
I am not able to change the structure of my data unfortunately. Having read this, I realise if I had a _metadata
or _common_metadata
file, then I would see significant speed up. However, this is not possible for me
So, why is dask so much slower than Spark in this particular use case? And more specifically, is there anything I can do to speed up the reading of the parquet files in dask?
Extra details
* There are a total of over 1000 columns.
* Data is saved out using Java (we can't changed this)
* Package versions - dask==2.15.0
, dask-yarn==0.8.1
, distributed==2.15.2
, pyarrow==0.17.0
* Each day=*
is typically about 6.4gb on hard disk. With the largest datasource=*
partition being about 1.5gb. Individual parquet files are between 13MB and 150MB
* We tried fastparquet
as an alternative to the pyarrow
engine, but this was slower than pyarrow