I've read in around 15 csv files:
df = dd.read_csv("gs://project/*.csv", blocksize=25e6,
storage_options={'token': fs.session.credentials})
Then I persisted the Dataframe (it uses 7.33 GB memory):
df = df.persist()
I set a new index because I want my group by on that field to be as efficient as possible:
df = df.set_index('column_a').persist()
Now I have 181 divisions and 180 partitions. To try out how fast my group by was going I tried a custom apply function that just prints the Group Dataframe:
grouped_by_index = df.groupby('column_a').apply(lambda n: print(n)).compute()
That printed a Dataframe with correct columns but the values are either "1", "foo" or "True". Example:
column_b column_c column_d column_e column_f column_g \
index
a foo 1 foo 1 1 1
I also get the warning:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning:
meta
is not specified, inferred from partial data. Please providemeta
if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8'))
for series result """Entry point for launching an IPython kernel.
What is going on here?