2

Is it possible to get the partition_id in dask after splitting pandas DFs

For example:

import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(np.random.randn(10,2), columns=["A","B"])
df_parts = dd.from_pandas(df, npartitions=2)
part1 = df_parts.get_partition(0)

In the 2 parts, part1 is the first_partition. So is it possible to do something like the following:

part1.get_partition_id() => which will return 0 or 1

Or is it possible to get the partition ID by iterating through df_parts?

data_person
  • 4,194
  • 7
  • 40
  • 75

1 Answers1

0

Not sure about built-in functions, but you can achieve what you want with enumerate(df_parts.to_delayed()).

to_delayed will produce a list of delayed objects, one per partition, so you can iterate over them, keeping track of the sequential number with enumerate.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46