I noticed a change in the performance in one of my scripts that is using Pandas to_csv
to write a data set to file. Specifically, when writing a long csv (@ 1 mil rows the difference is very pronounced) with a column of type Period[D], performance is extremely bad. Removing the column, or transforming it into a string using apply makes it perform as it was before.
There are two changes that happened recently to my environment- windows 7 to windows 10 and pandas 23.xx (I believe) to pandas 24.2
A simple example of my problem is below:
import pandas as pd, os, numpy as np
file_path = r'<file path here>'
df_period = pd.DataFrame(data={"ints":np.random.randint(0,100000000,(1000000)), "days": pd.period_range("2012-01-01", periods=1000000, freq="D")})
df_strings = df_period.copy()
df_strings["days"] = df_strings["days"].apply(lambda x: str(x))
%timeit df_strings.to_csv(os.path.join(file_path, "strings_test.csv"))
1 loop, best of 3: 1.55 s per loop
%timeit df_period.to_csv(os.path.join(file_path, "period_test.csv"))
1 loop, best of 3: 33.1 s per loop
Performance is 20x faster for 1 million rows of this example dataset when either removing the column or converting it first to strings. This difference is less pronounced but still present at 100k rows. Why is this happening?
When manually interrupting the slow to_csv
call, I've commonly seen the script performing the following function:
return lambda x: Period._from_ordinal(ordinal=x, freq=self.freq)
Has something changed in pandas 24.xx that caused the decrease in performance?