0

Is Dask proper to read large csv files in parallel and split them into multiple smaller files?

rpanai
  • 12,515
  • 2
  • 42
  • 64

2 Answers2

1

Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.

Generate dataframe

import dask.dataframe as dd
import numpy as np
import pandas as pd
import string

letters = list(string.ascii_lowercase)

N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
                   "values":np.random.rand(N)})

df.to_csv("file.csv", index=False)

One parquet file (folder) per member

If you're happy to have the output in as parquet you can just use the option partition_on as

df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")

If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.

rpanai
  • 12,515
  • 2
  • 42
  • 64
1

Yes, dask can read large CSV files. It will split them into chunks

df = dd.read_csv("/path/to/myfile.csv")

Then, when saving, Dask always saves CSV data to multiple files

df.to_csv("/output/path/*.csv")

See the read_csv and to_csv docstrings for much more information about this.

MRocklin
  • 55,641
  • 23
  • 163
  • 235