Slicing out a few rows from a `dask.DataFrame`

Question

Often, when working with a large dask.DataFrame, it would be useful to grab only a few rows on which to test all subsequent operations.

Currently, according to Slicing a Dask Dataframe, this is unsupported.

I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame.
I also tried df[:1000], which executes, but generates an output different from that you'd expect from Pandas.

Is there any way to grab the first 1000 rows from a dask.DataFrame?

MRocklin · Accepted Answer · 2018-03-06T20:52:34.233

8

If your dataframe has a sensibly partitioned index then I recommend using .loc

small = big.loc['2000':'2005']

If you want to maintain the same number of partitions, you might consider sample

small = big.sample(frac=0.01)

If you just want a single partition, you might try get_partition

small = big.get_partition(0)

You can also, always use to_delayed and from_delayed to build your own custom solution. http://dask.pydata.org/en/latest/dataframe-create.html#dask-delayed

More generally, Dask.dataframe doesn't keep row-counts per partition, so the specific question of "give me 1000 rows" ends up being surprisingly hard to answer. It's a lot easier to answer questions like "give me all the data in January" or "give me the first partition"

edited Mar 06 '18 at 20:52

answered Mar 06 '18 at 20:31

MRocklin

55,641
23
163
235

Thanks, grabbing the partition is good enough for my purpose. Any comment on whether `df[:1000]` does the right thing or not, compared to Pandas? – Stefan van der Walt Mar 06 '18 at 21:01
Pandas devs would recommend not using `df[:1000]` and using `.loc` or `.iloc` explicitly. You probably were referring to `.iloc` which is not supported in dask.dataframe. – MRocklin Mar 06 '18 at 22:26
2

This behavior is described both in the docs (http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges) and is mentioned by Wes on at least one S/O answer. So I suspect that silently doing something else than Pandas here is going to trip users up. – Stefan van der Walt Mar 07 '18 at 01:18

score 3 · Answer 2 · answered Feb 03 '19 at 21:07

You may repartition your initial DataFrame into an arbitrary number of partitions. If you want slices of 1000 rows :

npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)

Then just call the partition you wish :

first_1000_rows = parted_df.partitions[0]

Note that unless the number of rows in your initial DataFrame is a multiple of 1000, you won't get exactly 1000 rows.

Slicing out a few rows from a `dask.DataFrame`

2 Answers2

Linked