I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas.
Here comes in dask. Dask is fast but with many errors.
This is a snippet of the code,
#drop some columns
df = df.drop(['dropoff_latitude', 'dropoff_longitude','pickup_latitude', 'pickup_longitude', 'pickup_datetime' ], axis=1)
# In[ ]:
#one-hot-encode cat columns
df = dd.get_dummies(df.categorize())
# In[ ]:
#split train and test and export as csv
test_df = df[df['fare_amount'] == -9999]
train_df = df[df['fare_amount'] != -9999]
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
which when run the lines;
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
produces the error
ValueError: The columns in the computed data do not match the columns
in the provided metadata
What could cause this and how can I stop it.
N.B First time using Dask.