7

I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first partition. So if i do df.reset_index(), it should work but thats not the case

Below is the code to reproduce this:

import dask.dataframe as dd
import pandas as pd

data = pd.DataFrame({
     'i64': np.arange(1000, dtype=np.int64),
     'Ii32': np.arange(1000, dtype=np.int32),
     'bhello': np.random.choice(['hello', 'Yo', 'people'], size=1000).astype("O")
})

daskDf = dd.from_pandas(data, chunksize=3)
daskDf = daskDf.set_index('bhello')
print(daskDf.head())
pranav kohli
  • 123
  • 2
  • 6
  • Please can you change this example to be runnable? I'm trying to recreate on my computer and having to work through multiple steps that needn't exist i.e. making a class, working out what `dd` is (I assumed `import dask as dd` but I'm getting errors) – roganjosh May 25 '18 at 07:54
  • It gives a warning, aligned with the answer by @coldspeed. `UserWarning: Insufficient elements for head. 5 elements requested, only 0 elements available. Try passing larger npartitions to head.` – roganjosh May 25 '18 at 07:57

2 Answers2

15

Try calling head with npartitions=-1, to use all partitions (by default, only the first is used, and there may not be enough elements to return the head).

daskDf.head(npartitions=-1)
cs95
  • 379,657
  • 97
  • 704
  • 746
0

This works as expected for me

In [1]: import numpy as np

In [2]: import dask.dataframe as dd
   ...: import pandas as pd
   ...: 
   ...: data = pd.DataFrame({
   ...:      'i64': np.arange(1000, dtype=np.int64),
   ...:      'Ii32': np.arange(1000, dtype=np.int32),
   ...:      'bhello': np.random.choice(['hello', 'Yo', 'people'], size=1000).as
   ...: type("O")
   ...: })
   ...: 

In [3]: daskDf = dd.from_pandas(data, chunksize=3)

In [4]: daskDf
Out[4]: 
Dask DataFrame Structure:
                  Ii32  bhello    i64
npartitions=333                      
0                int32  object  int64
3                  ...     ...    ...
...                ...     ...    ...
996                ...     ...    ...
999                ...     ...    ...
Dask Name: from_pandas, 333 tasks

In [5]: daskDf.head()
/home/mrocklin/workspace/dask/dask/dataframe/core.py:4221: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[5]: 
   Ii32 bhello  i64
0     0     Yo    0
1     1     Yo    1
2     2  hello    2
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Yeah actually I had an index on 'bhello' column. Because of that my first partition was coming up empty and hence df.head() was was returning empty df. Doing df.head(npartitions = -1) worked for me – pranav kohli May 28 '18 at 07:06