3

Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len.

As I understand, the following code will make dask to compute df twice, am I right?

df = dd.read_csv('path/to/file', dtype=some_dtypes)
#some operations...
df.to_csv("path/to/out/*")
print(len(df))

It is possible to avoid computing twice?

upd. That's what happens when I use solution by @mdurant enter image description here

but there are really almost 6 times less rows

enter image description here

elfinorr
  • 189
  • 3
  • 12
  • Calculating the length of a lazy dataframe efficiently requires some work. See [this question](https://stackoverflow.com/questions/41902069/slow-len-function-on-dask-distributed-dataframe). Or try bringing into memory once, i.e. `df = df.compute(), df.to_csv(...), print(len(df))`. – jpp Jul 30 '18 at 12:54
  • Thank you for your answer. I've read that topic, it's about a little bit different thing. Well, I got it, I either bring it to the memory or compute for the second time. – elfinorr Jul 30 '18 at 13:15
  • Yeh I'm not sure if `df.to_csv` is lazy or not when used with `dask`. If *not*, then you might as well read into memory before you use `to_csv`. – jpp Jul 30 '18 at 13:18
  • `dd.to_csv` starts computations, as far as I know. I thought about it, but I'm not sure. Say, I have data split on 20 partitions, I'd like to save it 1 csv file. If i do `df.compute().to_csv(...)`, it will give me 1 file, but wouldn't it be ineffective for this purpose, as dask has to 'group up' data from partitions? – elfinorr Jul 30 '18 at 13:31

1 Answers1

7

Yes, you can achieve this. The optional keyword compute= to to_csv to make a lazy version of the write-to-disc process, and df.size, which is like len(), but also lazily computed.

import dask
futs = df.to_csv("path/to/out/*", compute=False)
_, l = dask.compute(futs, df.size)

This will notice the common work required for the writing and length and not have to read the data twice.

mdurant
  • 27,272
  • 5
  • 45
  • 74