3

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....

How to set up and calculate right the divisions of DASK dataframe?

VadimCh
  • 71
  • 1
  • 9
  • Have you read this [doc](http://docs.dask.org/en/latest/dataframe-design.html)? – rpanai Jun 05 '19 at 15:04
  • Yes i read. In this doc showing how to set_index with precalculated divisions, But at what rule i need calculate divisions? – VadimCh Jun 05 '19 at 20:10

2 Answers2

1

If you read from parquet you can use infer_divisions=True as in this example

import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)

In case you need you can directly set an index while reading

df = dd.read_parquet("file.parq", index="my_col",
                     infer_divisions=True)
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • 'df = dd.read_parquet("file.parq", index="my_col", infer_divisions=True)' This works only if the index of the underlying dataset is sorted across the individual parquet files. And what solution if index is not sorted? – VadimCh Jun 05 '19 at 20:01
  • No, its correct. But DASK engine can not infer_divisions with unsorted index. May be another decision? – VadimCh Jun 05 '19 at 20:13
  • But, in dd.read_parquet("file.parq", infer_divisions=True) index need to be specified. – VadimCh Jun 05 '19 at 20:16
  • You can use without specify the index. In that case it's going to set a divisions over your index 0,..,n – rpanai Jun 05 '19 at 20:19
  • No its doesn't, with both pyarrow and fastparquet. – VadimCh Jun 05 '19 at 20:31
  • ValueError: Unable to infer divisions for because no index column was discovered – VadimCh Jun 05 '19 at 20:35
  • Do you mind to share a sample of your df? Or create a [mcve](/help/mcve)? – rpanai Jun 05 '19 at 21:11
0

OK, i do:

divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()

Then i do:

f.groupby('userId').first().compute()

But last operation is dramatically slow!

VadimCh
  • 71
  • 1
  • 9