I could not get this line to work (because I was passing dictionary
as a basic Python dictionary, which is not the right input)
ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
index=list(range(2))), name='ddf')
print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right
So, I had to create some dummy data and use that in my approach to creating a dask dataframe
.
Generate dummy data in a dictionary
d = {0: [388]*2,
1: [387]*2,
2: [386]*2,
3: [385]*2,
5: [384]*2,
'2012-06-13': [389]*2,
'2012-06-14': [389]*2,}
Create Dask dataframe
from dictionary dask bag
- this means you must first use pandas to convert the dictionary to a pandas
DataFrame
and then use .to_dict(..., orient='records')
to get the sequence (list of row-wise dictionaries) you need to create a dask bag
So, here is how I created the required sequence
d = pd.DataFrame(d, index=list(range(2))).to_dict('records')
print(d)
[{0: 388,
1: 387,
2: 386,
3: 385,
5: 384,
'2012-06-13': 389,
'2012-06-14': 389},
{0: 388,
1: 387,
2: 386,
3: 385,
5: 384,
'2012-06-13': 389,
'2012-06-14': 389}]
Now I use the list of dictionaries to create a dask bag
dask_bag = db.from_sequence(d, npartitions=2)
print(dask_bag)
dask.bag<from_se..., npartitions=2>
Convert dask bag to dask dataframe
df = dask_bag.to_dataframe()
Rename columns in dask dataframe
cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)
print(df)
Dask DataFrame Structure:
Datetime col1 col2 col3 col5 2012-06-13 2012-06-14
npartitions=2
int64 int64 int64 int64 int64 int64 int64
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: rename, 6 tasks
Compute the dask dataframe
(will not get output of ()
this time !)
print(ddf.compute())
Datetime col1 col2 col3 col5 2012-06-13 2012-06-14
0 388 387 386 385 384 389 389
0 388 387 386 385 384 389 389
Notes:
- Also from the
.rename
documentation: inplace
is not supported.
- I think your renaming dictionary contained strings
'0'
, '1'
, etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0
, 1
, etc.
- Per the
dask
docs, I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
- this means you don't need to pass in column names that you do not need to be renamed