3

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?

Here's a simplified snippet of my code:

   temp_dd = dd.read_parquet(read_str, gather_statistics=False)
   temp_dd = dask_client.scatter(temp_dd, broadcast=True)
   dask_wait([temp_dd])
   temp_dd = dask_client.gather(temp_dd)

   while row_batch <= max_row:
       row_batch_dd = temp_dd.get_partition(row_batch)
       row_batch_dd = row_batch_dd.dropna()    
       row_batch_dd_len = row_batch_dd.index.size  # <-- this is the current way I'm determining the length
       row_batch = row_batch + 1

I note that, while I am reading a parquet, I can't simply use the parquet info (which is very fast) because, after reading, I do some partition-by-partition processing and then drop the NaNs. It's the post-processed length per partition that I'd like.

dan
  • 183
  • 13

1 Answers1

3
df = dd.read_parquet(fn, gather_statistics=False)
df = df.dropna()
df.map_partitions(len).compute()
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • explanation please? – eggie5 May 06 '20 at 17:28
  • 1. Read in parquet file into variable `df`. 2. drop missing values from data frame, `df`. 3. For each `df` partition, calculate the the `len` (this is what `map_partitions` is doing). And return the calcualted value to the user (this is what `compute` is doing). Untill `compute` is called, everthing is "lazy" so you don't get any work done. This works because in normal pandas `len(df)` will give you the number of rows in a dataframe. When you use `map_partitions` a dataframe gets passed into the function. – Daniel Chen Jul 16 '20 at 14:50