4

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing?

import dask.dataframe as dd
df = dd.read_parquet("Filepath")
df.head() # takes 10 seconds

df = df.set_index('id')

df.head() # takes 10 minutes +
AZhao
  • 13,617
  • 7
  • 31
  • 54

1 Answers1

5

As stated in the docs, set_index sorts your data according to the new index, such that the divisions along that index split the data into its logical partitions. The sorting is the thing that requires the extra time, but will make operations working on that index much faster once performed. head() on the raw file will fetch from the first data chunk on disc without regard for any ordering.

You are able to set the index without this ordering either with the index= keyword to read_parquet (maybe the data was inherently ordered already?) or with .map_partitions(lambda df: df.set_index(..)), but this raises the obvious question, why would you bother, what are you trying to achieve? If the data were already sorted, then you could also have used set_index(.., sorted=True) and maybe even the divisions keyword, if you happen to have the information - this would not need the sort, and be correspondingly faster.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    This is specific to the dataset and extra information that may be in the file metadata or service from which the data came. If you use `index=` on a column without appropriate explicit metadata, dask will complain. – mdurant Jun 18 '19 at 13:16
  • I'm adding that : df=df.sort_values('id') also has the same problem with head – rafine Mar 22 '22 at 10:56