Save dask dataframe to csv and find out its length without computing twice

Question

Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len.

As I understand, the following code will make dask to compute df twice, am I right?

df = dd.read_csv('path/to/file', dtype=some_dtypes)
#some operations...
df.to_csv("path/to/out/*")
print(len(df))

It is possible to avoid computing twice?

upd. That's what happens when I use solution by @mdurant

but there are really almost 6 times less rows

Calculating the length of a lazy dataframe efficiently requires some work. See [this question](https://stackoverflow.com/questions/41902069/slow-len-function-on-dask-distributed-dataframe). Or try bringing into memory once, i.e. `df = df.compute(), df.to_csv(...), print(len(df))`. — jpp, Jul 30 '18 at 12:54
Thank you for your answer. I've read that topic, it's about a little bit different thing. Well, I got it, I either bring it to the memory or compute for the second time. — elfinorr, Jul 30 '18 at 13:15
Yeh I'm not sure if `df.to_csv` is lazy or not when used with `dask`. If *not*, then you might as well read into memory before you use `to_csv`. — jpp, Jul 30 '18 at 13:18
`dd.to_csv` starts computations, as far as I know. I thought about it, but I'm not sure. Say, I have data split on 20 partitions, I'd like to save it 1 csv file. If i do `df.compute().to_csv(...)`, it will give me 1 file, but wouldn't it be ineffective for this purpose, as dask has to 'group up' data from partitions? — elfinorr, Jul 30 '18 at 13:31

score 7 · Accepted Answer · answered Jul 30 '18 at 14:21

7

Yes, you can achieve this. The optional keyword compute= to to_csv to make a lazy version of the write-to-disc process, and df.size, which is like len(), but also lazily computed.

import dask
futs = df.to_csv("path/to/out/*", compute=False)
_, l = dask.compute(futs, df.size)

This will notice the common work required for the writing and length and not have to read the data twice.

answered Jul 30 '18 at 14:21

mdurant

27,272
5
45
74

thank you for your answer, but it works not exactly as I expected. Could you check out the update of my post, please? – elfinorr Aug 01 '18 at 14:13
I take it you have six columns :) Size is len*n_col, should have said. – mdurant Aug 01 '18 at 14:22

Save dask dataframe to csv and find out its length without computing twice

1 Answers1