I have dataframe df
I would like to partition df
into sub-dataframes and apply function find_root
on each of them. My function only takes columns id
and parent_id
as input.
Then I would like to concatenate resulted dataframes. Because my dataframe is huge (over 4 million rows), I would like to use Dask. Then I have an error
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra: []
Missing: [2]
Could you please elaborate on how to solve this error?
import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=2, threads_per_worker=1, processes=False, memory_limit='4GB')
def find_root(df):
g = nx.from_pandas_edgelist(df, source = 'parent_id', target = 'id', create_using = nx.DiGraph())
roots = {n for n, d in g.in_degree() if d == 0}
tmp = {}
for r in roots:
tree = dfs_tree(g, r)
tmp[r] = list(tree.nodes)
tmp = pd.DataFrame.from_dict(tmp, orient = 'index').T
tmp = tmp.melt(value_name = 'node', var_name = 'root').dropna()
return tmp
path = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df.csv'
df = dd.read_csv(path, header = 0)
df = df[['id', 'created_utc', 'ups', 'link_id', 'author', 'body', 'parent_id']]
df['parent_id'] = df['parent_id'].str.split('_', expand = True, n = 2)[1]
df['link_id'] = df['link_id'].str.split('_', expand = True, n = 2)[1]
result = df.groupby('link_id').apply(find_root, meta = object)
computed_result = result.compute()
Update: I added dtype
to dd.read_csv
df = dd.read_csv(path, header = 0, dtype = {'id': 'str', 'parent_id': 'str', 'link_id': 'str'})
but the problem persists.