1

I have some Python (3.8) code that does the following:

  1. Walks directory and subdirectories of a given path
  2. Finds all .csv files
  3. Finds all .csv files with 'Pct' in filename
  4. Joins path and file
  5. Reads CSV
  6. Adds filename to df
  7. Concatonates all dfs together

The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.

My current working code is below:

start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/'  # change this
df_result= None
#loop_number = 0

for path, dirs, files in os.walk(start_dirctory):
        for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
            # print(file)
            if 'Pct' in file: # filter if contains 'Pct'
                # print('Pct = ', file)
                full_name=os.path.join(path, file) # make full file path
                df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
                df_tmp['file']=os.path.basename(file) # df.file = file name
                if df_result is None:
                    df_result= df_tmp
                else:
                    df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
                #print(full_name, 'imported')
                #loop_number = loop_number + 1
                #print('Loop number =', loop_number)

Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.

What I have tried below (incomplete):

%%time

import glob
import pandas as pd

df = pd.concat(
    [pd.read_csv(f, header=None)
     for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
    ],
    axis='index', ignore_index=True
 )

Question

Is there any way that I can reduce the time to read and ingest the CSV's in my code above?

Thanks!

  • If the bottleneck is the reading of the CSVs, and you are not reading from a single spinning HDD, then you could parallelize reading with [threads](https://docs.python.org/3/library/threading.html). Or [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) if the bottleneck is the parsing of the CSVs (unlikely). – C14L Nov 25 '21 at 08:41

1 Answers1

0

Check out the following solution, this assumes the open-file system limit is high enough, because this will stream every file one by one, but it has to open each of them to read headers. In cases where files have different columns, you will get the superset of them in the resulting file:

from convtools import conversion as c
from convtools.contrib.tables import Table

files = sorted(
    os.path.join(path, file)
    for path, dirs, files in os.walk(start_dirctory)
    for file in files
    if "Pct" in file and file.endswith(".csv")
)

table = None
for file in files:
    table_ = Table.from_csv(file, header=True)  # assuming there's header
    if table is None:
        table = table_
    else:
        table.chain(table_)

# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict)  # or list, or tuple

# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice

If you are sure that all the files have same columns (one file is being opened at a time):

edited to add file column

def concat_files(files):
    for file in files:
        yield from Table.from_csv(file, header=True).update(
            file=file
        ).into_iter_rows(dict)

# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)

P.S. of course you can replace Table.from_csv with a standard/other reader, but this one adapts to the file, so it is generally faster on large files.

westandskif
  • 972
  • 6
  • 9
  • Thank you! All files have exactly the same columns (there are no column headers). How can i best add the full filename (contains the date) to the dataframe as the csv is read and will this work with your second suggestion? – Cairan Van Rooyen Nov 24 '21 at 16:28
  • @CairanVanRooyen, please see above -- I've updated the second one. for more info - https://convtools.readthedocs.io/en/latest/tables.html – westandskif Nov 24 '21 at 16:47
  • @CairanVanRooyen let me know if it worked for you any better than initial solution, just curious – westandskif Nov 24 '21 at 17:48
  • i cannot get your suggested code to run. I am getting "AssertionError: move_rows called the 2nd time" @ line "table.chain(table_)" – Cairan Van Rooyen Nov 25 '21 at 09:11
  • @CairanVanRooyen got it! things like `table.into_iter_rows(dict)` or `table.into_csv("concatenated.csv")` can only be done once, because it works with a data stream, so it cannot assume it is allowed to read it twice, so choose your one – westandskif Nov 25 '21 at 10:04