how can one use dask.dataframe with custom dsk graphs

Question

I'll try to rephrase my question:

How do I combine a dask.dataframe along with a function like zip?

assume we have a file named "accounts.0.csv" with the following data

id,names,amount
352,Dan,4837
387,Tim,208
42,Jerry,21
129,Patricia,284

i wrote this code

import dask.dataframe as dd
import itertools
from dask.threaded import get


df = dd.read_csv('accounts.0.csv')

dsk = {'a': (dd.read_csv,('accounts.0.csv')),       
       'b': (itertools.repeat,(True)),       
       'res': (zip, 'a'[id],'b')       
       }

get(dsk, 'res')

This code should generate something like this:

352, True
387, True
42 , True
129, True

how can i do this ?

what are you trying to do here? what should the zip be operating over and why do you want to do it in parallel? — Sam Mason, Oct 21 '15 at 11:33
this is part of bigger a graph. this example just show the problem , it doesn't have a real meanning. my question is how do you join/zip together a dask.series and an iterator — sami, Oct 21 '15 at 14:17

score 0 · Answer 1 · answered Oct 21 '15 at 14:35

You need to "lift" (to borrow terminology from Monads in Haskell) the iterator out from inside the computation, dask builds the list of tasks before starting any computation and hence you need to get at the iterator from "outside" any computation. Your call to compute gets you "outside" dask which is why that worked.

I'm not sure of a good example because what you'd do depends on what other tasks are next, but as a not very nice but minimal example:

import dask.imperative as di

arr = []
for col in df:
    arr.append(ddf[col].map(lambda x: (x,True)))
task = di.value([])+arr

creates a list of tasks that map over values within each series. The then uses the imperative module to wrap everything in a task—could't find a nicer way of doing this, sorry!

You can then compute the task to get a list of series back, or use it in something else.

Thanks @Sam Mason , i'm sorry for not being clear, English is not my first language. my problem is iterating over dask.dataframe like i would do with pandas's dataframe. i edit my question , i hope it is clearer now — sami, Oct 21 '15 at 20:38

score 0 · Answer 2 · answered Oct 21 '15 at 15:26

Rephrase Question

I'll try to rephrase your question as the following:

How do I combine a dask.dataframe along with a custom dask graph?

df = dd.read_csv('myfile.csv')
dsk = {'x': (add, 1, 2)}

The dataframe is a high-level collection, the dask graph is more low-level. We'll have to bring one to the others' level.

Use dask imperative

We could use dask.imperative to turn the custom function into a high-level dsak object

# dsk = {'x': (inc, 1, 2)}
x = dask.do(add)(1, 2)

Then you can use dask.compute on either one or both of the objects.

x_result = dask.compute(x)
or
df_result = dask.compute(df)
or
x_result, df_result = dask.compute(x, df)

Use low-level dask graphs everywhere

The low-level graph and final keys for any DataFrame object are accessible from the .dask and ._keys() attributes.

from toolz import merge
graph = merge(dsk, df.dask)  # merge both graphs together
keys = ['x', df._keys()]     # final keys to compute

x_results, df_results = get(graph, keys)

df_result = df._finalize(df_results)  # turn graph outputs back to pandas dataframe

Thanks @MRocklin , i'm sorry for not being clear, English is not my first language. my problem is using dask.dataframe with zip like i would do with pandas's dataframe. i edit my question , i hope it is clearer now — sami, Oct 21 '15 at 20:34

MRocklin · Accepted Answer · 2015-10-21T22:09:59.610

0

Zip is intended for Python iterators, not Pandas or Dask DataFrames.

To implement your example above you could use the assign method

pandas

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3]})

In [3]: df
Out[3]: 
   x
0  1
1  2
2  3

In [4]: df.assign(y=True)
Out[4]: 
   x     y
0  1  True
1  2  True
2  3  True

dask.dataframe

In [5]: import dask.dataframe as dd

In [6]: ddf = dd.from_pandas(df, npartitions=1)

In [7]: ddf.assign(y=True).compute()
Out[7]: 
   x     y
0  1  True
1  2  True
2  3  True

Generally don't mix graphs with dataframes

The dictionary-style graphs like dsk = {...} should not be mixed with the dask.dataframe objects. The dask.dataframe objects use the graphs internally. They should not be placed within them.

edited Oct 21 '15 at 22:09

answered Oct 21 '15 at 21:29

MRocklin

55,641
23
163
235

Thanks @MRocklin, i hoped that i could use dataframes with a code like the one found in Dask's documentation http://dask.pydata.org/en/latest/custom-graphs.html . if i want to use this template (of load,clean,analyze,store) , could i use it with pandas.dataframe? – sami Oct 21 '15 at 21:44
Does the following documentation commit answer your question? https://github.com/blaze/dask/commit/0bb7189738134e4f9596ee0142a47e6d2d37e4b3 – MRocklin Oct 21 '15 at 22:09
partially , i see that you can load pandas's dataframe . can you use graphs like dsk {...} to manipulate them. you wrote "don't mix graphs with dataframes". – sami Oct 21 '15 at 22:38
Don't mix graphs and dask.dataframes. The graphs that define dask.dataframes contain pandas.dataframes. – MRocklin Oct 21 '15 at 22:40
greate , Thanks a lot – sami Oct 21 '15 at 22:43