Splitting very large csv files into smaller files

Question

Is Dask proper to read large csv files in parallel and split them into multiple smaller files?

score 1 · Answer 1 · answered Nov 27 '19 at 11:51

Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.

Generate dataframe

import dask.dataframe as dd
import numpy as np
import pandas as pd
import string

letters = list(string.ascii_lowercase)

N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
                   "values":np.random.rand(N)})

df.to_csv("file.csv", index=False)

One `parquet` file (folder) per member

If you're happy to have the output in as parquet you can just use the option partition_on as

df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")

If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.

score 1 · Accepted Answer · answered Nov 29 '19 at 17:08

Yes, dask can read large CSV files. It will split them into chunks

df = dd.read_csv("/path/to/myfile.csv")

Then, when saving, Dask always saves CSV data to multiple files

df.to_csv("/output/path/*.csv")

See the read_csv and to_csv docstrings for much more information about this.

Splitting very large csv files into smaller files

2 Answers2

Generate dataframe

One `parquet` file (folder) per member

Linked

Splitting very large csv files into smaller files

2 Answers2

Generate dataframe

One parquet file (folder) per member

Linked

One `parquet` file (folder) per member