Design - How to handle timestamps (storage) and when performing computations ; Python

Question

I'm trying to determine (as my application is dealing with lots of data from different sources and different time zones, formats, etc) how best to store my data AND work with it.

For example, should I store everything as UTC? This means when I fetch data I need to determine what timezone it is currently in, and if it's NOT UTC, do the necessary conversion to make it so. (Note, I'm in EST).

Then, when performing computations on the data, should I extract (say it's UTC) and get into MY time zone (EST), so it makes sense when I'm looking at it? I should I keep it in UTC and do all my calculations?

A lot of this data is time series and will be graphed, and the graph will be in EST.

This is a Python project, so lets say I have a data structure that is:

"id1": {
    "interval": 60,                            <-- seconds, subDict['interval']
    "last": "2013-01-29 02:11:11.151996+00:00" <-- UTC, subDict['last']
},

And I need to operate on this, by determine if the current time (now()) is > the last + interval (has the 60 second elapsed)? So in code:

lastTime = dateutil.parser.parse(subDict['last'])    
utcNow = datetime.datetime.utcnow().replace(tzinfo=tz.tzutc())

if lastTime + datetime.timedelta(seconds=subDict['interval']) < utcNow:
    print "Time elapsed, do something!"

Does that make sense? I'm working with UTC everywhere, both stored and computationally...

Also, if anyone has links to good write-ups on how to work with timestamps in software, I'd love to read it. Possibly like a Joel On Software for timestamp usage in applications ?

sorry - I missed an edit there - however, TZ's are notorious for "not being right" — Jon Clements, Jan 30 '13 at 03:46
No problem - hmm yeah presents another issue... What a PITA. How are other projects that manage lots of time data doing this? — mr-sk, Jan 30 '13 at 03:47
Ha, well, my first POC for this app didn't really either...and it turned out to be really painful to debug issues. And the code looks like crap. For the rewrite I want to meet this issue head on and solve it the best I can. — mr-sk, Jan 30 '13 at 03:49
AFAIK you're doing the best you can using the appropriate libraries... — Jon Clements, Jan 30 '13 at 03:50
Also, perhaps why you're doing this? Unless you want to calculate date offsets, you can always store your own etc.... possibly as UTC and then go from there — Jon Clements, Jan 30 '13 at 03:54
I'm doing this because I want to graph sets of data that may have different timezones, but need to be converted in order to be graphed properly. Not sure if that explains it, but I made a POC of this application a few weeks back and ran into issues because I was avoiding and doing last minute conversions of the timezones. I had a lot of code like: startEpoch = int(time.mktime(time.strptime(startDate.split("-")[0] + '-09:30:00', "%Y/%m/%d-%H:%M:%S"))) - 28800 ... — mr-sk, Jan 30 '13 at 04:13

score 3 · Accepted Answer · answered Jan 30 '13 at 04:31

It seems to me as though you're already doing things 'the right way'. Users will probably expect to interact in their local time zone (input and output), but it's normal to store normalized dates in UTC format so that they are unambiguous and to simplify calculation. So, normalize to UTC as soon as possible, and localize as late as possible.

Some small amount of information about Python and timezone processing can be found here:

My current preference is to store dates as unix timestamp tv_sec values in backend storage, and convert to Python datetime.datetime objects during processing. Processing will usually be done with a datetime object in the UTC timezone and then converted to a local user's timezone just before output. I find having that having a rich object such as a datetime.datetime helps with debugging.

Timezone are a nuisance to deal with and you probably need to determine on a case-by-case basis whether it's worth the effort to support timezones correctly.

For example, let's say you're calculating daily counts for bandwidth used. Some questions that may arise are:

What happens on a daylight saving boundary? Should you just assume that a day is always 24 hours for ease of calculation or do you need to always check for every daily calculation that a day may have less or more hours on the daylight savings boundary?
When presenting a localized time, does it matter if a time is repeated? eg. If you have an hourly report display in localtime without a time zone attached, will it confuse the user to have a missing hour of data, or a repeated hour of data around daylight savings changes.

Good points. My post has generated a lot of detailed responses and I'm still digesting them all. — mr-sk, Jan 30 '13 at 16:00
I decided to do this and store the UTC timestamp. After reading the Django docs I decided this is a good idea and I feel comfortable doing it. Thanks for the input. — mr-sk, Jan 31 '13 at 03:24
No worries, I'm glad you asked the question because I was interested in others responses too. — Austin Phillips, Jan 31 '13 at 03:34

Ellioh · Answer 2 · 2013-01-31T05:43:29.130

Since, as I can see, you do not seem to be having any implementation problems, I would focus rather on design aspects than on code and timestamp format. I have an experience of participating in design of network support for a navigation system implemented as a distributed system in a local network. The nature of that system is such that there is a lot of data (often conflicting), coming from different sources, so solving possible conflicts and keeping data integrity is rather tricky. Just some thoughts based on that experience.

Timestamping data, even in a distributed system including many computers, usually is not a problem if you do not need a higher resoluition than one provided by system time functions and higher time synchronization accuracy than one provided by your OS components.

In the simplest case using UTC is quite reasonable, and for most of tasks it's enough. However, it's important to understand the purpose of using time stamps in your system from the very beginning of design. Time values (no matter if it is Unix time or formatted UTC strings) sometimes may be equal. If you have to resolve data conflicts based on timestamps (I mean, to always select a newer (or an older) value among several received from different sources), you need to understand if an incorrectly resolved conflict (that usually means a conflict that may be resolved in more than one way, as timestamps are equal) is a fatal problem for your system design, or not. The probable options are:

If the 99.99% of conflicts are resolved in the same way on all the nodes, you do not care about the remaining 0.01%, and they do not break data integrity. In that case you may safely continue using something like UTC.
If strict resolving of all the conflicts is a must for you, you have to design your own timestamping system. Timestamps may include time (maybe not system time, but some higher resolution timer), sequence number (to allow producing unique timestamps even if time resolution is not enough for that) and node identifier (to allow different nodes of your system to generate completely unique timestamps).
Finally, what you need may be not timestamps based on time. Do you really need to be able to calculate time difference between a pair of timestamps? Isn't it enough just to allow ordering timestamps, not connecting them to real time moments? If you don't need time calculations, just comparisons, timestamps based on sequential counters, not on real time, are a good choice (see Lamport time for more details).

If you need strict conflict resolving, or if you need very high time resolution, you will probably have to write your own timestamp service.

Many ideas and clues may be borrowed from a book by A. Tanenbaum, "Distributed systems: Principles and paradigms". When I faced such problems, it helped me a lot, and there is a separate chapter dedicated to timestamps generation in it.

Great points. #3 stands out to me because I have time series data that will eventually be graphed (capital market data) and I have to relate real life/time events to it. I'm still digesting your entire post before I further comment. — mr-sk, Jan 30 '13 at 15:58

steveha · Answer 3 · 2013-01-30T04:08:39.667

I think the best approach is to store all timestamp data as UTC. When you read it in, immediately convert to UTC; right before display, convert from UTC to your local time zone.

You might even want to have your code print all timestamps twice, once in local time and the second time in UTC time... it depends on how much data you need to fit on a screen at once.

I am a big fan of the RFC 3339 timestamp format. It is unambiguous to both humans and machines. What is best about it is that almost nothing is optional, so it always looks the same:

2013-01-29T19:46:00.00-08:00

I prefer to convert timestamps to single float values for storage and computations, and then convert back to the datetime format for display. I wouldn't keep money in floats, but timestamp values are well within the precision of float values!

Working with time floats makes a lot of code very easy:

if time_now() >= last_time + interval:
    print("interval has elapsed")

It looks like you are already doing it pretty much this way, so I can't suggest any dramatic improvements.

I wrote some library functions to parse timestamps into Python time float values, and convert time float values back to timestamp strings. Maybe something in here will be useful to you:

http://home.blarg.net/~steveha/pyfeed.html

I suggest you look at feed.date.rfc3339. BSD license, so you can just use the code if you like.

EDIT: Question: How does this help with timezones?

Answer: If every timestamp you store is stored in UTC time as a Python time float value (number of seconds since the epoch, with optional fractional part), you can directly compare them; subtract one from another to find out the interval between them; etc. If you use RFC 3339 timestamps, then every timestamp string has the timezone right there in the timestamp string, and it can be correctly converted to UTC time by your code. If you convert from a float value to a timestamp string value right before displaying, the timezone will be correct for local time.

Also, as I said, it looks like he is already pretty much doing this, so I don't think I can give any amazing advice.

There is no explanation to the OP how this helps with timezones? — Jon Clements, Jan 30 '13 at 03:55
Thanks - this does make sense, but I decided to go w/the UTC timestamp instead of the RFC3339 - partially because I already felt comfortable w/the UTC implementation VS the RFC3339 one. I do appreciate the feedback and will keep this solution in mind for future projects. — mr-sk, Jan 31 '13 at 03:25
I am not sure what you mean by "UTC timestamp". By default an RFC3339 timestamp is UTC (if the timezone is either "Z" or "-00:00"). In any event, good luck with your project. — steveha, Jan 31 '13 at 06:05
Sorry - I meant the format 2013-01-29T19:46:00.00-08:00 VS 2013-01-29 02:11:11.151996+00:00. If the first is RFC3339 UTC timestamp, what's the second one called? — mr-sk, Jan 31 '13 at 15:07
I'm not sure if it has a common name, but it it just like an RFC3339 timestamp with a space instead of a `T` in the middle. I just checked the RFC 3339 spec document, and it looks like the `T` can optionally be replaced by a space. So maybe it's just an acceptable variant of RFC3339. — steveha, Jan 31 '13 at 18:43
I just checked my PyFeed code, and it should parse your timestamps just fine; the regular expression is pretty generous. PyFeed will produce timestamps with the `T` in the middle but you can trivially replace it: `timestamp_str.replace('T', ' ')` — steveha, Jan 31 '13 at 18:47

Martino Dino · Answer 4 · 2013-01-30T06:13:42.797

Personally I'm using the Unix-time standard, it's very convenient for storage due to its simple representation form, it's merely a sequence of numbers. Since internally it represent UTC time, you have to make sure to generate it properly (converting from other timestamps) before storing and format it accordingly to any time zone you want.

Once you have a common timestamp format in the backend data (tz aware), plotting the data is very easy as is just a matter of setting the destination TZ.

As an example:

import time
import datetime
import pytz
# print pre encoded date in your local time from unix epoch
example = {"id1": {
                   "interval": 60,
                   "last": 1359521160.62
                   }
           }
#this will use your system timezone formatted
print time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(example['id1']['last']))
#this will use ISO country code to localize the timestamp
countrytz = pytz.country_timezones['BR'][0]
it = pytz.timezone(countrytz)
print  it.localize(datetime.datetime.utcfromtimestamp(example['id1']['last']))

Design - How to handle timestamps (storage) and when performing computations ; Python

4 Answers4