Fastest way to convert dtypes for big datasets in Python?

Question

So, I'm working with a relatively big dataset and I feel like it's taking to much time to convert the columns into their proper dtypes.

So far, I'm using apply with to_datetime and to_numeric like so:

df.iloc[:,[0,1,9]] = df.iloc[:,[0,1,9]].apply(pd.to_datetime, 
errors='coerce')
df.iloc[:,2:8] = df.iloc[:,2:8].apply(pd.to_numeric, errors='coerce')

I was able to convert the columns, but it took ~20 minutes. There must be a quicker way?

If not, are my only choices to cut down the dataset for data exploration or get a faster computer?

EDIT: The problem was mainly due to using to_datetime without formatting the date and time. There was also improvement in performance when I removed iloc and apply, though it is not as significant as formatting the date and time.

Here's the time each scenario took:

No formatting using iloc took 1027.11 s to run
No formatting without using iloc took 789.15 s to run
datetime with formatting took 19.47 s to run

Huge improvement. This was on a dataset with 2,049,280 rows. Thanks @ScottBoston and @DiegoAgher!

How was this dataset created in the first place? If it was read from a file it'd better to pass hints for the dtypes. — EdChum, May 22 '17 at 15:43
I used pd.read_csv('dataset.txt', sep=';', low_memory=False). I tried using dtypes, but I kept getting an error telling me that I couldn't convert the columns into floats. — Jacques Thibodeau, May 22 '17 at 15:57
See this SO post: https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31 — Scott Boston, May 22 '17 at 15:58
I used dtypes = {'Date': 'str', 'Everything_else': 'float64'} and then inserted dtypes=dtypes inside read_csv. — Jacques Thibodeau, May 22 '17 at 16:01
@ScottBoston Thanks, Scott. I just realized that it is definitely to_datetime which is causing the problem. to_numeric took 5 seconds on its own. I will add the format string to to_datetime and see if it helps. — Jacques Thibodeau, May 22 '17 at 16:08
@JacquesThibodeau Please post back the results. I am curious — Scott Boston, May 22 '17 at 16:09
@ScottBoston It worked! It took ~20 seconds. I'm gonna run it without formatting to see the actual time difference. — Jacques Thibodeau, May 22 '17 at 16:23
@JacquesThibodeau Awesome. What sort of time difference did you get on how many rows? — Scott Boston, May 22 '17 at 17:03
@ScottBoston I just edited the OP a second ago with the answer to your question, have a look! — Jacques Thibodeau, May 22 '17 at 18:10

Diego Aguado · Accepted Answer · 2017-05-22T15:51:54.880

0

The apply function usually takes quite some execution time. Column based operations are faster, you could do:

df['column0'] = pd.to_datetime(df['column0'], errors='coerce')

and so on for the rest of the columns.

Also, if you have a specific format for the column you could try specifying it to speed things up.

df['column0'] = pd.to_datetime(df['column0'], format=format, errors='coerce')

edited May 22 '17 at 15:51

answered May 22 '17 at 15:44

Diego Aguado

1,604
18
36

The OP is iterating over the cols of interest so what they're trying already takes this into account – EdChum May 22 '17 at 15:46
I think I read a while back on Stack Overflow someone mentioned that adding formatting to time will help with speed. Explicitly add time format string to the to_datetime. – Scott Boston May 22 '17 at 15:48
@EdChum I just wanted to take away the `.iloc[:, ...]` operation since I thought this could be adding up execution time. – Diego Aguado May 22 '17 at 15:50
@ScottBoston updated the answer with your contribution, which I think is also very relevant ;) – Diego Aguado May 22 '17 at 15:52
@DiegoAgher In this case it's not a big deal since I only have 10 columns, but I want to add iloc in case I use a dataset with much more columns so I didn't have to write a new line for each column... haha – Jacques Thibodeau May 22 '17 at 16:10

Fastest way to convert dtypes for big datasets in Python?

1 Answers1