19

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for.

http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

head(...)

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

repartition(...)

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

Is the number of partitions probably 5 in this case:

(Image source: http://dask.pydata.org/en/latest/dataframe-overview.html )

Community
  • 1
  • 1
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

2 Answers2

35

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.

  1. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time.
  2. If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.

Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.

You can determine the number of partitions either at data ingestion time using the parameters like blocksize= in read_csv(...) or afterwards by using the .repartition(...) method.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

I tried to check what is the optimal number for my case. I work on laptops with 8 cores. I have 100Gb csv files with 250M rows and 25 columns. I run the function "describe" on 1,5,30,1000 partitions

df = df.repartition(npartitions=1)
a1=df['age'].describe().compute()
df = df.repartition(npartitions=5)
a2=df['age'].describe().compute()
df = df.repartition(npartitions=30)
a3=df['age'].describe().compute()
df = df.repartition(npartitions=100)
a4=df['age'].describe().compute()

about speed :

5,30 > around 3 minutes

1, 1000 > around 9 minutes

but ...I found that "order" functions like median or percentile give wrong number when I used more than one partition .

1 partition give right number (I checked it with small data using pandas and dask)

rafine
  • 361
  • 3
  • 18