I have a dask dataframe, backed by parquet. It's 131million rows, when I do some basic operations on the whole frame they take a couple of minutes.
df = dd.read_parquet('data_*.pqt')
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()
Time:
CPU times: user 2min 49s, sys: 23.9 s, total: 3min 13s
Wall time: 1min 53s
I naively thought that if I took a sample of the data that I could bring this time down, and did:
df = dd.read_parquet('data_*.pqt')
df = df.sample(frac=0.05)
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()
Time:
Unknown, I stopped it after 45minutes.
I'm guessing that my sample can't be accessed efficiently for all my follow-on computations, but I don't know how to fix it.
I'm interested in the best way to sample data from a dask dataframe and then work with that sample.