Is Dask proper to read large csv files in parallel and split them into multiple smaller files?
Asked
Active
Viewed 816 times
2 Answers
1
Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.
Generate dataframe
import dask.dataframe as dd
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase)
N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
"values":np.random.rand(N)})
df.to_csv("file.csv", index=False)
One parquet
file (folder) per member
If you're happy to have the output in as parquet
you can just use the option partition_on
as
df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")
If you then really need csv
you can convert to this format. I strongly suggest you to move your data to parquet
.

rpanai
- 12,515
- 2
- 42
- 64
1
Yes, dask can read large CSV files. It will split them into chunks
df = dd.read_csv("/path/to/myfile.csv")
Then, when saving, Dask always saves CSV data to multiple files
df.to_csv("/output/path/*.csv")
See the read_csv and to_csv docstrings for much more information about this.

MRocklin
- 55,641
- 23
- 163
- 235