1

I have an issue with Python + Numpy + Pandas.

I am having a list of timestamps, with millisecond accuracy, encoded as strings. Then I round them to 10ms resolution, that goes well. The bug comes when I add the rounded timestamps to DataFrame as a new column - the values of datetime64 objects get totally destroyed.

Am I doing something wrong? Or is that a Pandas/NumPy error?

BTW, I have suspicion, that this bug comes up only on Windows - I did not notice it when I tried the same code on Mac yesterday (have not verified this).

import numpy
import pandas as pd

# We create a list of strings. 
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
                '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored, 
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time 

# Then we create a Pandas DataFrame and assign the time array as a 
# column to it. The datetime64 is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df

The output I get:

    one two wrong_time
a   1.0 1.0 1974-10-01 18:11:07.585
b   2.0 2.0 1974-10-01 18:11:07.625
c   3.0 3.0 1974-10-01 18:11:07.645
d   NaN 4.0 1974-10-01 18:11:07.665

Output of pd.show_versions():

INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
Girts Strazdins
  • 848
  • 9
  • 15
  • You can use `pd.to_datetime(time_str_arr)` – cs95 Aug 06 '17 at 18:30
  • I tried the pd.to_datetime(time_str_arr). It did not change anything. The bug is not converting the strings to datetime. That step works fine. The bug is that the datetime64 array gets destroyed (or not imported correctly) when I try to add it to the dateframe. – Girts Strazdins Aug 06 '17 at 18:33

2 Answers2

1

In my opinion it is bug, because obviously numpy.datetime64 is casted to Timestamps internally.

For me works use to_datetime:

df = df.assign(wrong_time=pd.to_datetime(rounded_time))
print (df)
   one  two              wrong_time
a  1.0  1.0 2017-06-30 13:51:15.850
b  2.0  2.0 2017-06-30 13:51:16.250
c  3.0  3.0 2017-06-30 13:51:16.450
d  NaN  4.0 2017-06-30 13:51:16.650

Another solution is cast to ns:

df = df.assign(wrong_time=rounded_time.astype('datetime64[ns]'))
print (df)
   one  two              wrong_time
a  1.0  1.0 2017-06-30 13:51:15.850
b  2.0  2.0 2017-06-30 13:51:16.250
c  3.0  3.0 2017-06-30 13:51:16.450
d  NaN  4.0 2017-06-30 13:51:16.650
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I opened an issue in Pandas Git repository. And got a suggested solution from Jeff Reback: Instead of creating weird 10ms datetime64 object, we simply round the timestamps using floor() function:

In [16]: # We create a list of strings. 
...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
...:                 '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']

In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)

Solution from https://github.com/pandas-dev/pandas/issues/17183

Girts Strazdins
  • 848
  • 9
  • 15