2

I want to concatenate 10,000 single-line CSV files into one 10,000 row CSV file, ideally using Python. The directory contains only the CSV files. I have tried doing this with Pandas however it is way to slow, seems like it'll take about an hour. Is there a faster way of doing this? My current code is:

files = glob.glob("PATH/*.csv")
df = pd.concat(map(pd.read_csv, files))
  • 2
    You could try using python polars. There is a massive boost in reading and writing csv files. – Ethan Small Mar 16 '23 at 12:17
  • 2
    Isn't this doable from command line using a simple `for` loop and `type >> outfile` ? (even in Windows). – MyICQ Mar 16 '23 at 12:18
  • You can try to use Modin, which is a drop-in replacement for Pandas (https://github.com/modin-project/modin) – Louis-Justin Tallot Mar 16 '23 at 12:35
  • Do you necessarily have to / want to use python for that? Otherwise, a simple bash or cmd script would do the job a lot quicker. What OS are you using? Even with python, as long as the files are the same format, consider just reading and appending files line by line, using something like open(), f.read() and f.write(). – Torge Rosendahl Mar 16 '23 at 12:38
  • 3
    Have you tried reading it as a line of text and simply appending it to the output file? – Thomas Weller Mar 16 '23 at 12:41
  • 1
    @physicist1911 please provide one example csv file. Does it have a header line/column names? Additionally, do you need the combined file as a pandas frame or is your goal to just combine the files into one? – Torge Rosendahl Mar 16 '23 at 13:13
  • 1
    @TorgeRosendahl seems that the csv files do have headers. But there are few potential elephants in the room: 1) what to do if file num 9876 has a different number of columns? 2) detecting the column data types. – darked89 Mar 16 '23 at 17:22
  • 1
    @physicist1911 just in case that there is something "wrong" with a small subset of your input CSVs you may consider pre-processing them in batches of say 100 or 1000 to check that the CSV structure/data types in columns do match. – darked89 Mar 16 '23 at 17:30

6 Answers6

3

Here is a simple trivial run of the mill solution which reads one line at a time into memory.

import glob

inputs = glob.glob("*.csv")
with open("output.csv", "w") as output:
  first = True
  for file in inputs:
    with open(file, "r") as inputfile:
      for no, line in enumerate(inputfile, 1):
        if no == 1 and not first:
          continue
        first = False
        output.write(line)

This takes care to drop the first line from each input file except the first, on the assumption that you have CSV files with a single-line header.

(Do yourself a favor and stop using headers in CSV files, though. Then the code can be simplified somewhat.)

This performs no attempt to verify that the CSV files have identical structures. You should make sure that they have the same number of columns and the same data in the same columns.

There could also be problems if your data mixes Windows and Unix line feeds. A simple tweak is to open the files in binary mode, even though they are technically text files.

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

Use the csv module to skip the headers and create a list of rows.

import csv
import glob
import pandas as pd

rows = []
for filename in glob.glob("*.csv"):
   with open(filename, newline="") as f:
      reader = csv.reader(f)
      columns = next(reader)
      rows.extend(reader)

df = pd.DataFrame(rows, columns=columns)
jqurious
  • 9,953
  • 1
  • 4
  • 14
1

Try using polars, as the author sells itself: "Lightning-fast DataFrame library for Rust and Python". For my use case, i have 110 files of ~1mb, polars is 21x faster:

%%timeit
#Polars
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pl = pl.read_csv(files[0], sep=";", skip_rows=1, infer_schema_length=0)
for file in files[1:]:
    df_pl = pl.concat([df_pl, pl.read_csv(file, sep=";", skip_rows=1, 
                       infer_schema_length=0)], how="diagonal")

3.2 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
#Pandas
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pd = pd.read_csv(files[0], delimiter=";")
for file in files[1:]:
    df_pd = pd.concat([df_pd, pd.read_csv(file, delimiter=";")])

1min 8s ± 784 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars has even another ways to optmize that I didn't use. For example, lazy function (which allows paralelization from the library) and also infering the data types correctly.

Also, try invoking polars (or pandas) concat fewer times. It's best to put all the read files in a list and then concat them all. If your memory can't handle that, try concatening after a few read files. In my test, it's 9x faster:

%%timeit
#Polars adv
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pl2 = []
for file in files:
    df_pl2.append(pl.read_csv(file, sep=";", skip_rows=1, infer_schema_length=0))
df_pl2 = pl.concat(df_pl2, how="diagonal")

362 ms ± 8.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Just a little explanation on the code: pl.read_csv(file, sep=";", skip_rows=1, infer_schema_length=0)

  1. I skipped the first row, because, in my files, the first line was useless, the headers were on the 2nd row. Pandas recognized that
  2. infer_schema_length=0 is a bad practice, it forces polars to identify all columns as string. I was just lazy to give the correct dtype. You probably won't need that

pl.concat(df_pl2, how="diagonal") diagonal is important in polars when dataframes don't have the exact same columns (some of my files were missing some columns)

0

You can use multiprocessing:

import pandas as pd
import pathlib
import multiprocessing as mp


def read_csv(files):
    return pd.concat([pd.read_csv(filename) for filename in files], axis=1)


if __name__ == '__main__':
    batch = 10
    files = sorted(pathlib.Path('PATH').glob('*.csv'))
    with mp.Pool(mp.cpu_count()) as pool:
        dfs = pool.map(read_csv, (files[i:i+batch]
                                  for i in range(0, len(files), batch)))
    df = pd.concat(dfs)

Note: you can replace (files[i:i+batch] for i in range(0, len(files), batch)) by itertools.pairwise(files, batch) if you use Python>=3.10

Corralien
  • 109,409
  • 8
  • 28
  • 52
0

This is assuming that there is no header and that the csv files a look something like this:

<start of file>
1,test,12.5,hello there,0
<end of file>

If you necessarily want a python solution, here is a plain text version.

import glob

files = glob.glob('*.csv')

with open('output.csv', 'a') as output:

    for fname in files:
        with open(fname, 'r') as input:
            output.write(input.readline())

Keep in mind that this really only works for single lines csv files, because we only call readline() once per input file, i.e., just reading the first line. readline furthermore automatically adds a line break to the end of the line if there is none.

Torge Rosendahl
  • 482
  • 6
  • 17
  • 1
    Your answer is wrong (and I don't downvote your answer...) – Corralien Mar 16 '23 at 12:54
  • 1
    @Corralien How exactly is it wrong? As long as the CSV files are literally one line each, it should actually work. – tripleee Mar 16 '23 at 13:01
  • 1
    @tripleee. You know the problem is the header... The OP uses `pd.read_csv` without `header=None` that means there are headers in file, right? – Corralien Mar 16 '23 at 13:03
0

EDIT: I first posted an answer using Dask. But then I saw this answer.

I think you should check it out.

KVN
  • 3
  • 2
  • The performance when the files are huge might be different (like, 10,000 fields per file?) – tripleee Mar 16 '23 at 12:59
  • @tripleee For merging 12,000 CSV files : - Pandas took 118.99971747398376 seconds - Python Built-in methods took 0.796720027923584 seconds – KVN Mar 16 '23 at 14:43