So, I'm working with a relatively big dataset and I feel like it's taking to much time to convert the columns into their proper dtypes.
So far, I'm using apply
with to_datetime
and to_numeric
like so:
df.iloc[:,[0,1,9]] = df.iloc[:,[0,1,9]].apply(pd.to_datetime,
errors='coerce')
df.iloc[:,2:8] = df.iloc[:,2:8].apply(pd.to_numeric, errors='coerce')
I was able to convert the columns, but it took ~20 minutes. There must be a quicker way?
If not, are my only choices to cut down the dataset for data exploration or get a faster computer?
EDIT: The problem was mainly due to using to_datetime without formatting the date and time. There was also improvement in performance when I removed iloc and apply, though it is not as significant as formatting the date and time.
Here's the time each scenario took:
- No formatting using iloc took 1027.11 s to run
- No formatting without using iloc took 789.15 s to run
- datetime with formatting took 19.47 s to run
Huge improvement. This was on a dataset with 2,049,280 rows. Thanks @ScottBoston and @DiegoAgher!