Python Pandas to_csv 20x slower with a column of type period[D]

Question

I noticed a change in the performance in one of my scripts that is using Pandas to_csv to write a data set to file. Specifically, when writing a long csv (@ 1 mil rows the difference is very pronounced) with a column of type Period[D], performance is extremely bad. Removing the column, or transforming it into a string using apply makes it perform as it was before.

There are two changes that happened recently to my environment- windows 7 to windows 10 and pandas 23.xx (I believe) to pandas 24.2

A simple example of my problem is below:

import pandas as pd, os, numpy as np
file_path = r'<file path here>'
df_period = pd.DataFrame(data={"ints":np.random.randint(0,100000000,(1000000)), "days": pd.period_range("2012-01-01", periods=1000000, freq="D")})
df_strings = df_period.copy()
df_strings["days"] = df_strings["days"].apply(lambda x: str(x))
%timeit df_strings.to_csv(os.path.join(file_path, "strings_test.csv"))
1 loop, best of 3: 1.55 s per loop
%timeit df_period.to_csv(os.path.join(file_path, "period_test.csv"))
1 loop, best of 3: 33.1 s per loop

Performance is 20x faster for 1 million rows of this example dataset when either removing the column or converting it first to strings. This difference is less pronounced but still present at 100k rows. Why is this happening?

When manually interrupting the slow to_csv call, I've commonly seen the script performing the following function:

return lambda x: Period._from_ordinal(ordinal=x, freq=self.freq)

Has something changed in pandas 24.xx that caused the decrease in performance?

Python Pandas to_csv 20x slower with a column of type period[D]

0 Answers0