Try using polars, as the author sells itself: "Lightning-fast DataFrame library for Rust and Python". For my use case, i have 110 files of ~1mb, polars is 21x faster:
%%timeit
#Polars
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pl = pl.read_csv(files[0], sep=";", skip_rows=1, infer_schema_length=0)
for file in files[1:]:
df_pl = pl.concat([df_pl, pl.read_csv(file, sep=";", skip_rows=1,
infer_schema_length=0)], how="diagonal")
3.2 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
#Pandas
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pd = pd.read_csv(files[0], delimiter=";")
for file in files[1:]:
df_pd = pd.concat([df_pd, pd.read_csv(file, delimiter=";")])
1min 8s ± 784 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars has even another ways to optmize that I didn't use. For example, lazy function (which allows paralelization from the library) and also infering the data types correctly.
Also, try invoking polars (or pandas) concat fewer times. It's best to put all the read files in a list and then concat them all. If your memory can't handle that, try concatening after a few read files. In my test, it's 9x faster:
%%timeit
#Polars adv
files = glob.glob(r"U:\Negocios CETIP\*.CSV")
df_pl2 = []
for file in files:
df_pl2.append(pl.read_csv(file, sep=";", skip_rows=1, infer_schema_length=0))
df_pl2 = pl.concat(df_pl2, how="diagonal")
362 ms ± 8.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Just a little explanation on the code:
pl.read_csv(file, sep=";", skip_rows=1, infer_schema_length=0)
- I skipped the first row, because, in my files, the first line was useless, the headers were on the 2nd row. Pandas recognized that
- infer_schema_length=0 is a bad practice, it forces polars to identify all columns as string. I was just lazy to give the correct dtype. You probably won't need that
pl.concat(df_pl2, how="diagonal") diagonal is important in polars when dataframes don't have the exact same columns (some of my files were missing some columns)