I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv
the merged file. The task proves harder than imagined:
I put together a toy example with just two data sets The code I then use is the following:
import dask.dataframe as dd
import glob
import os
os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")
dfs = []
for file in file_list:
ddf = dd.read_table(file, sep=';')
dfs.append(ddf)
dd_all = dd.concat(dfs)
If I use dd_all.to_csv('*.csv')
I simply print out the two original data sets.
If I use dd_all.to_csv('name.csv')
I get an error saying the file does not exist.
(FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part'
)
I can check that using dd_all.compute()
the merged data set had successfully been created.