How to use Dask for with .groupby.apply?

Question

I have dataframe df

I would like to partition df into sub-dataframes and apply function find_root on each of them. My function only takes columns id and parent_id as input.

Then I would like to concatenate resulted dataframes. Because my dataframe is huge (over 4 million rows), I would like to use Dask. Then I have an error

ValueError: The columns in the computed data do not match the columns in the provided metadata
  Extra:   []
  Missing: [2]

Could you please elaborate on how to solve this error?

import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=2, threads_per_worker=1, processes=False, memory_limit='4GB')

def find_root(df):
    g = nx.from_pandas_edgelist(df, source = 'parent_id', target = 'id', create_using = nx.DiGraph())
    roots = {n for n, d in g.in_degree() if d == 0}
    tmp = {}
    for r in roots:
        tree = dfs_tree(g, r)
        tmp[r] = list(tree.nodes)
    tmp = pd.DataFrame.from_dict(tmp, orient = 'index').T
    tmp = tmp.melt(value_name = 'node', var_name = 'root').dropna()
    return tmp

path = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df.csv'
df = dd.read_csv(path, header = 0)
df = df[['id', 'created_utc', 'ups', 'link_id', 'author', 'body', 'parent_id']]
df['parent_id'] = df['parent_id'].str.split('_', expand = True, n = 2)[1]
df['link_id'] = df['link_id'].str.split('_', expand = True, n = 2)[1]

result = df.groupby('link_id').apply(find_root, meta = object)
computed_result = result.compute()

Update: I added dtype to dd.read_csv

df = dd.read_csv(path, header = 0, dtype = {'id': 'str', 'parent_id': 'str', 'link_id': 'str'})

but the problem persists.

Does this help: https://stackoverflow.com/questions/52248566/valueerror-the-columns-in-the-computed-data-do-not-match-the-columns-in-the-pro — user1558604, Mar 18 '21 at 23:38
The problem arises with `str`, try: `ddf['link_id'].str.split('_', expand=True, n=2).head()`. I'm not sure what's causing this. — SultanOrazbayev, Mar 19 '21 at 04:56

How to use Dask for with .groupby.apply?

0 Answers0