I have several files whose with a column called idx
and I would like to use it as index. The dataframe obtained has about 13M row. I know that I can read and assign index in this way (which is slow ~40 s)
df = dd.read_parquet("file-*.parq")
df = df.set_index("idx")
or in this other way (which is quick ~40 ms)
df = dd.read_parquet("file-*.parq", index = "idx")
A simple operation as calculate the length is ~4x faster with the second method. What I don't understand is
- on the first case
df.known_divisions
returnsTrue
while on the second isFalse
. I expected the opposite behaviour. I then did several operations on top ofdf
and the without a known_division I'm always obtaining better performance. I'm scratching my head to figure out if this is happening on purpose or not. - the number of partitions is the number of files. How can I set a different number of partitions?
UPDATE
It is not just calculating len
which is faster. In my calculation I create 4 new dataframes using groupby, apply and join several times and these are the timings
| |Load and reindex (s)|Load with index (s)|
|:-----------------|-------------------:|------------------:|
| load | 12.5000 | 0.0124 |
| grp, apply, join | 11.4000 | 6.2700 |
| compute() | 146.0000 | 125.0000 |
| TOTAL | 169.9000 | 131.2820 |