2

So, I'm working with a relatively big dataset and I feel like it's taking to much time to convert the columns into their proper dtypes.

So far, I'm using apply with to_datetime and to_numeric like so:

df.iloc[:,[0,1,9]] = df.iloc[:,[0,1,9]].apply(pd.to_datetime, 
errors='coerce')
df.iloc[:,2:8] = df.iloc[:,2:8].apply(pd.to_numeric, errors='coerce')

I was able to convert the columns, but it took ~20 minutes. There must be a quicker way?

If not, are my only choices to cut down the dataset for data exploration or get a faster computer?

EDIT: The problem was mainly due to using to_datetime without formatting the date and time. There was also improvement in performance when I removed iloc and apply, though it is not as significant as formatting the date and time.

Here's the time each scenario took:

  • No formatting using iloc took 1027.11 s to run
  • No formatting without using iloc took 789.15 s to run
  • datetime with formatting took 19.47 s to run

Huge improvement. This was on a dataset with 2,049,280 rows. Thanks @ScottBoston and @DiegoAgher!

Jacques Thibodeau
  • 859
  • 1
  • 8
  • 21

1 Answers1

0

The apply function usually takes quite some execution time. Column based operations are faster, you could do:

df['column0'] = pd.to_datetime(df['column0'], errors='coerce')

and so on for the rest of the columns.

Also, if you have a specific format for the column you could try specifying it to speed things up.

df['column0'] = pd.to_datetime(df['column0'], format=format, errors='coerce')
Diego Aguado
  • 1,604
  • 18
  • 36
  • The OP is iterating over the cols of interest so what they're trying already takes this into account – EdChum May 22 '17 at 15:46
  • I think I read a while back on Stack Overflow someone mentioned that adding formatting to time will help with speed. Explicitly add time format string to the to_datetime. – Scott Boston May 22 '17 at 15:48
  • @EdChum I just wanted to take away the `.iloc[:, ...]` operation since I thought this could be adding up execution time. – Diego Aguado May 22 '17 at 15:50
  • @ScottBoston updated the answer with your contribution, which I think is also very relevant ;) – Diego Aguado May 22 '17 at 15:52
  • @DiegoAgher In this case it's not a big deal since I only have 10 columns, but I want to add iloc in case I use a dataset with much more columns so I didn't have to write a new line for each column... haha – Jacques Thibodeau May 22 '17 at 16:10