Expectation: I would expect that, when I partition a given dataframe, the rows will be roughly evenly distributed into each partition. I would then expect that, when I write the dataframe to csv, the resulting n csvs (in this case, 10), would similarly be of roughly equal length.
Reality: When I run the below code, I find that instead of a somewhat even distribution of rows, all rows are in export_results-0.csv
and the remaining 9 csvs are empty.
Question: Are there additional configurations that I need to set to ensure that rows are distributed amongst all the partitions?
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
client = Client('tcp://10.0.0.60:8786')
df = pd.DataFrame({'geom': np.random.random(1000)}, index=np.arange(1000))
sd = dd.from_pandas(df, npartitions=100)
tall = dd.merge(sd.assign(key=0), sd.assign(key=0), on='key').drop('key', axis=1)
tall.to_csv('export_results-*.csv').compute()
About the above code: In the below code, I create a 1000 row dataframe and merge it with itself in order to create a 1000000 row long dataframe (the goal is to eventually generate a thin and tall table that holds the distance from any one to any other geometry from a list of 100k+).