I'm trying to use get_dummies
via dask
but it does not transform my variable, nor does it error out:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical)
>>> daskDataDummies.head()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
>>> daskDataDummies.compute()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
5 F
6 M
7 F
8 M
9 F
>>>
The pandas
equivilent (run in a new terminal just in case) is:
>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
gender_F gender_M
0 0.0 1.0
1 0.0 0.0
2 0.0 0.0
3 1.0 0.0
4 0.0 0.0
>>>
My understanding of this resolved issue is that it should work, but is it required to be pulled into pandas
first? If so it defeats the purpose of me using it since my datasets (~500GB) won't fit into a pandas
dataframe. Am I misreading this? TIA.