8

When running the following code, the result of dask.dataframe.head() depends on npartitions:

import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
ddf = dd.from_pandas(df, npartitions = 3)
print(ddf.head())

This yields the following result:

   A  B
0  1  2

However, when I set npartitions to 1 or 2, I get the expected result:

   A  B
0  1  2
1  2  3
2  3  4

It seems to be important, that npartitions is lower than the length of the dataframe. Is this intended?

Arco Bast
  • 3,595
  • 2
  • 26
  • 53
  • All your data (rows) is still there, though it won't be shown completely by `.head()`, `.tail()`, etc. But if you save it using `to_hdf()`, `to_csv()`, etc. then __all__ rows will be written. – MaxU - stand with Ukraine Jul 09 '16 at 08:43

1 Answers1

4

According to the documentation dd.head() only checks the first partition:

head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

So the answer is yes, dd.head() is influenced by how many partitions are there in your dask dataframe.

However the number of rows in the first partition is expected to be larger than the number of rows you usually want to show when using dd.head() — otherwise using dask shouldn't pay off. The only common case when this might not be true is when taking the first n rows/elements after filtering, as explained in this question.

Community
  • 1
  • 1
dukebody
  • 7,025
  • 3
  • 36
  • 61