2

Here is my problem in short: I am trying to write my data (containing, among other, np.datetime64 values) to csv and then read them back, and want my times not to change...

As discussed in many places, np.datetime64 keeps everything binary and UTC in mem, but reads strings from local time.

Here is a trivial example of my problem, here pd.read_csv("foo") saved from df.to_csv("foo") results on altering the times:

In[184]: num = np.datetime64(datetime.datetime.now())
In[185]: num
Out[181]: numpy.datetime64('2015-10-28T19:19:42.408000+0100')
In[186]: df = pd.DataFrame({"Time":[num]})
In[187]: df
Out[183]: 
                        Time
0 2015-10-28 18:19:42.408000
In[188]: df.to_csv("foo")
In[189]: df2=pd.read_csv("foo")
In[190]: df2
Out[186]: 
   Unnamed: 0                        Time
0           0  2015-10-28 18:19:42.408000
In[191]: np.datetime64(df2.Time[0])
Out[187]: numpy.datetime64('2015-10-28T18:19:42.408000+0100')
In[192]: num == np.datetime64(df2.Time[0])
Out[188]: False

(as usual:)

import numpy as np
improt pandas as pd

There is a very large number of questions, and lots of info on the web, but i've been googling for a while now and have not been able to find an answer on how to overcome this. There should be some way to save the data in Zulu, or read them supposing UTC, but have not found any directions on which would be the best (or even good?) way to do it. I can do

In[193]: num == np.datetime64(df2.Time[0]+"Z")
Out[189]: True

but that seems to me really bad, in terms of practice, portability and efficiency... (plus its annoying when using the default save and read messes things up)

ntg
  • 12,950
  • 7
  • 74
  • 95

1 Answers1

3

The numpy constructor is simply broken and will rarely do what you want. I would simply avoid. Use instead:

pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time'])

np.datetime64 is merely display in local timezone. It is already stored in UTC.

In [42]: num = np.datetime64(datetime.datetime.now())

In [43]: num
Out[43]: numpy.datetime64('2015-10-28T10:02:22.298130-0400')

In [44]: df = pd.DataFrame({"Time":[num]})

In [45]: df
Out[45]: 
                        Time
0 2015-10-28 14:02:22.298130

In [46]: pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time'])            
Out[46]: 
                        Time
0 2015-10-28 14:02:22.298130

In [47]: pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time']).Time.values
Out[47]: array(['2015-10-28T10:02:22.298130000-0400'], dtype='datetime64[ns]')

[47] is the just a local display. The time is as above.

Internally datetimes are kept as an int64 of ns since epoch.

In [7]: Timestamp('2015-10-28 14:02:22.298130')       
Out[7]: Timestamp('2015-10-28 14:02:22.298130')

In [8]: Timestamp('2015-10-28 14:02:22.298130').value
Out[8]: 1446040942298130000

In [9]: np.array([1446040942298130000],dtype='M8[ns]')
Out[9]: array(['2015-10-28T10:02:22.298130000-0400'], dtype='datetime64[ns]')

In [10]: Timestamp(np.array([1446040942298130000],dtype='M8[ns]').view('i8').item())
Out[10]: Timestamp('2015-10-28 14:02:22.298130')
ntg
  • 12,950
  • 7
  • 74
  • 95
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I get what numpy is doing, which is exactly as you describe. My problem was that pandas parse_dates=['Time'] does not work the same as the numpy.datetime64 constructor, e.g.: – ntg Oct 29 '15 at 14:57
  • In[46]: pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time']).Time[0]==np.datetime64(pd.read_csv(StringIO(df.to_csv(index=False))).Time[0]) Out[46]: False – ntg Oct 29 '15 at 15:04
  • So, as you write, the answer would be to use parse_dates=['Time']). Without it the resulting string through the numpy constructor can cause problems. – ntg Oct 29 '15 at 15:06
  • the numpy constructor is simply broken and will rarely do what you want. I would simply avoid. – Jeff Oct 29 '15 at 20:10