0

I'm using pandas .astype() to cast a dict of column names to their correct dtypes. It works for str, int, datetime64[ns], and float but is failing on timedelta64[ns]. When I run this I get ValueError: Could not convert object to NumPy timedelta.

import pandas as pd
import numpy as np

sample_row = pd.DataFrame([['g1', 
                            3912841, 
                            '2018-09-29 16:03:49', 
                            4.040196e+09, 
                            '1 days 15:49:38']], 
                          columns=['group',
                                   'job_number', 
                                   'submission_time', 
                                   'maxvmem', 
                                   'wait_time'])

sample_row = (sample_row.astype(dtype={'group':'str', 
                                       'job_number':'int', 
                                       'submission_time':'datetime64[ns]', 
                                       'maxvmem':'float', 
                                       'wait_time':'timedelta64[ns]'}))

I found this answer to a similar question but it seems to suggest I'm using the correct dtype format.


Update: Here's the same code with the suggested change from @hpaulj:

import pandas as pd
import numpy as np

sample_row = pd.DataFrame([['g1', 
                            3912841, 
                            '2018-09-29 16:03:49', 
                            4.040196e+09, 
                            pd.Timedelta('1 days 15:49:38')]],
                          columns=['group',
                                   'job_number', 
                                   'submission_time', 
                                   'maxvmem', 
                                   'wait_time'])

sample_row = (sample_row.astype(dtype={'group':'str', 
                                       'job_number':'int', 
                                       'submission_time':'datetime64[ns]', 
                                       'maxvmem':'float', 
                                       'wait_time':'timedelta64[ns]'}))

To confirm that the dtypes are set correctly:

for i in sample_row.loc[0, sample_row.columns]:
    print(type(i))

Output:

<class 'str'>
<class 'numpy.int32'>
<class 'pandas._libs.tslib.Timestamp'>
<class 'numpy.float64'>
<class 'pandas._libs.tslib.Timedelta'>
Karl Baker
  • 903
  • 12
  • 27
  • Your `astype` doesn't convert the `str` Series. – hpaulj Feb 06 '19 at 18:17
  • It doesn't look like `pandas` has converted the `"'1 days 15:49:38'` string to anything. `sample_row['wait_time').item()` is just a string. `numpy` can't create a `timedelta64` object from that string. – hpaulj Feb 06 '19 at 18:24
  • Try to use `pd.Timedelta('1 days 15:49:38')` when creating the DataFrame. pd. https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html – hpaulj Feb 06 '19 at 18:43
  • Thanks, @hpaulj, That works for the single value, not sure how it will work with my df with 4.5 million rows! I updated my post with your solution in case anyone else can benefit from it. – Karl Baker Feb 07 '19 at 08:00

0 Answers0