Computation on sample of dask dataframe takes much longer than on all the data

Question

I have a dask dataframe, backed by parquet. It's 131million rows, when I do some basic operations on the whole frame they take a couple of minutes.

df = dd.read_parquet('data_*.pqt')
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()

Time:

CPU times: user 2min 49s, sys: 23.9 s, total: 3min 13s
Wall time: 1min 53s

I naively thought that if I took a sample of the data that I could bring this time down, and did:

df = dd.read_parquet('data_*.pqt')
df = df.sample(frac=0.05)
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()

Time:

Unknown, I stopped it after 45minutes.

I'm guessing that my sample can't be accessed efficiently for all my follow-on computations, but I don't know how to fix it.

I'm interested in the best way to sample data from a dask dataframe and then work with that sample.

If I do `df = df.get_partition(0)` then the computation runs quickly, but that is not a random sample of my data. My dataframe has the same number of partitions as it there are files on disk, so I could just read in one file with pandas, but that's not the sampling I'd like to do. — birdsarah, Mar 09 '18 at 04:49
No clue. I recommend reading [Understanding Performance](http://dask.pydata.org/en/latest/understanding-performance.html) documentation, and in particular trying the dask.distributed diagnostic dashboard to get a sense of what's taking up time. I recommend looking at the profile plot [youtube link](https://youtu.be/N_GqzcuGLCY) — MRocklin, Mar 09 '18 at 13:21
Also, you might want to consider fusing the three compute calls into one with `a, b, c = dask.compute(http.sum(), https.sum(), unique_locations.size())` — MRocklin, Mar 09 '18 at 13:22
I've read Understanding Performance. I have some nice task graphs but I don't have an understanding of what a fast vs a quick task graph would look like. My slow computation has lots of parallel components that all come together at the end. Which I thought would make it quick. — birdsarah, Mar 09 '18 at 16:18
I am using dask on my local machine, not with distributed. I have used the profilers (ResourceProfiler, CacheProfiler, etc) and visualized them after a task is complete, as documented here: http://dask.pydata.org/en/latest/diagnostics-local.html. But I don't see a way to visualize while the operation is in progress and so I don't understand how to leverage these tools for my slow running process. (Last count I had waited over an hour frac=0.01) — birdsarah, Mar 09 '18 at 16:21
I recommend using the dask.distributed scheduler, even on your local machine, just to get the dashboards. See http://dask.pydata.org/en/latest/setup/single-distributed.html . This will give you diagnostic information *during* execution. — MRocklin, Mar 09 '18 at 16:30

score 0 · Answer 1 · answered Mar 10 '18 at 06:46

I don't have a definitive / simple answer, but I do have a number of things that all together solve my problem.

1) My code is inefficient, picking out the specific columns I need to work on makes everything work. My new code:

import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()  # Took me a little while to get the settings correct

def get_df(*columns):
    files = '../cache_new/sample_*.pqt'
    df = dd.read_parquet(files, columns=columns, engine='pyarrow')
    return df

# All data - Takes 31s
df_all = get_df('location')
unique_locations = df_all.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
_total_locations = unique_locations.size.persist()
_n_https = https.sum().persist()
_n_http = http.sum().persist()
progress(_total_locations, _n_https, _n_http)

# 1% sample data - Takes 21s
df_sample = get_df('location').sample(frac=0.01)
unique_locations = df_sample.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
_total_locations = unique_locations.size.persist()
_n_https = https.sum().persist()
_n_http = http.sum().persist()
progress(_total_locations, _n_https, _n_http)

This turns out to not be a big speed up. The time taken for the whole computation is dominated by reading in the data. If the computation was very expensive I imagine I would see more of a speed up.

2) I switched to using the distributed scheduler locally so I could see what was happening. But this was not without problems:

I was experiencing some kind of bug with fastparquet that caused my processes to die and I needed to use pyarrow (this was not a problem when not using distributed client)
I had to manually set the number of threads and memory_limit

3) I discovered a bug in reading the same data in multiple times in a notebook - https://github.com/dask/dask/issues/3268

4) I am also being hit by a memory leak bug in pandas https://github.com/pandas-dev/pandas/issues/19941#issuecomment-371960712

With (3) and (4) and the fact that in my original code I was inefficiently reading in all the columns, I see a number of reasons why my sample never worked although I never found a definitive answer.

score 0 · Answer 2 · answered Mar 10 '18 at 23:38

What's happening here is that by adding sample you're stopping an optimization for happening. When you do the following:

df = dd.read_parquet('data_*.pqt')
df.x.sum()

Dask cleverly rearranges this to actually be the following:

df = dd.read_parquet('data_*.pqt', columns=['x'])
df.x.sum()

Dask.dataframe only reads in the one column that you need. This is one of the few optimizations that dask.dataframe provides (it doesn't do much high-level optimization).

However, when you throw a sample in there (or any operation)

df = dd.read_parquet('data_*.pqt', columns=['x'])
df.sample(...).x.sum()

Then you don't get the optimization, and so everything is slow.

So here it's not that sample is slow, it's that the entire dataset from parquet is slow, and that having sample in between the read_parquet and column access steps blocks the optimization from happening.

Always specify columns in read_parquet

To avoid this, you should always specify the columns you need explicitly in dd.read_parquet.

Eventually it would be nice to see some high level framework provide query optimization that is more intelligent than what Dask dataframe has today. If you felt like pushing this forward, you would probably raise an issue on Ibis

Computation on sample of dask dataframe takes much longer than on all the data

2 Answers2

Always specify columns in read_parquet