6

I want to convert a huge file containing datetime strings to seconds since UNIX epoch (January 1, 1970) in C++. I need the computation to be very fast because I need to handle large amount of datetimes.

So far I've tried two options. The first was to use mktime, defined in time.h. The second option I tried was Howard Hinnant's date library with time zone extension.

Here is the code I used to compare the performance between mktime and Howard Hinnant's tz:

for( int i=0; i<RUNS; i++){
    genrandomdate(&time_str);

    time_t t = mktime(&time_str);

}

auto tz = current_zone()
for( int i=0; i<RUNS; i++){

    genrandomdate(&time_str);
    auto ymd = year{time_str.tm_year+1900}/(time_str.tm_mon+1)/time_str.tm_mday;
    auto tcurr = make_zoned(tz, local_days{ymd} + 
            seconds{time_str.tm_hour*3600 + time_str.tm_min*60 + time_str.tm_sec}, choose::earliest);
    auto tbase = make_zoned("UTC", local_days{January/1/1970});
    auto dp = tcurr.get_sys_time() - tbase.get_sys_time() + 0s;

}

The results of the comparison:

time for mktime : 0.000142s
time for tz : 0.018748s

The performance of tz is not good compared to mktime. I want something faster than mktime because mktime is also very slow when used repeatedly for large number iterations. Java Calendar provides a very fast way to do this, but I don't know any C++ alternatives for this when time zones are also in play.

Note: Howard Hinnant's date works very fast (even surpassing Java) when used without time zones. But that is not enough for my requirements.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
charitha22
  • 89
  • 9
  • Which part of this computation is the one causing the problem? That is, you spend some operations converting the time into a YMD plus some time adding in the hours/minutes/seconds. And then you do a seemingly pointless bit at the end, where you repeatedly compute a *constant value* (a value which is *zero*, since that's how UNIX time works, and `get_sys_time` is always UNIX time). – Nicol Bolas May 17 '19 at 18:16
  • Thanks for the comment. The bottleneck here is `make_zoned`. What I want to do is compute the seconds from epoch for some time point in a specific time zone. – charitha22 May 17 '19 at 18:26
  • OK, but why do you need to call `make_zoned` *twice* to do that? `get_sys_time` already returns the duration since the UNIX time epoch. – Nicol Bolas May 17 '19 at 18:27
  • Yeah sure. second `make_zoned` call can be ignored. I think make_zoned accesses a time zone data base every time. Any workaround to avoid that? Thanks – charitha22 May 17 '19 at 18:40
  • 1
    `make_zoned` only access the database if you pass in the _name_ of a time zone (which it has to lookup). If you pass in a pointer to a `time_zone`, all it does is store that pointer. – Howard Hinnant May 17 '19 at 18:44
  • It is still very slow. even if I pass the timezone from `current_zone()` to `make_zoned` (i.e. ignore the last 2 statements above) – charitha22 May 17 '19 at 21:01
  • I presume you're compiling with full optimization on? – Howard Hinnant May 17 '19 at 21:31
  • I tried with -O3 but still not better than mktime. note that Java calender is much faster than mktime. Looks like there's no way to reach that speed. Or maybe I am missing something. What I know is Java does not lookup the time zone everytime. may be that's the reason. Thanks! – charitha22 May 18 '19 at 00:16
  • [CCTZ](https://github.com/google/cctz) can do this much faster. I think it would be great if @HowardHinnant's library can avoid the binary search and allows to preset the timezone and do a constant time conversion computation. date has much cleaner interface than CCTZ IMO – charitha22 May 18 '19 at 02:49
  • 1
    Did you try `-DUSE_OS_TZDB=1` with optimizations on? – Howard Hinnant May 18 '19 at 02:50
  • 1
    Also, if by any chance your time points are not random, but likely to be close to each other, you could extract the `offset`, `begin` and `end` for the first time point, and then use that until `tp-offset >= end`, before you make another call to `time_zone->get_info()`. – Howard Hinnant May 18 '19 at 02:52

2 Answers2

12

There are some things you can do to optimize your use of Howard Hinnant's date library:

auto tbase = make_zoned("UTC", local_days{January/1/1970});

The lookup of a timezone (even "UTC") involves doing a binary search of the database for a timezone with that name. It is quicker to do a lookup once, and reuse the result:

// outside of loop:
auto utc_tz = locate_zone("UTC");

// inside of loop:
auto tbase = make_zoned(utc_tz, local_days{January/1/1970});

Moreover, I note that tbase is loop-independent, so the whole thing could be moved outside of the loop:

// outside of loop:
auto tbase = make_zoned("UTC", local_days{January/1/1970});

Here's a further minor optimization to be made. Change:

auto dp = tcurr.get_sys_time() - tbase.get_sys_time() + 0s;

To:

auto dp = tcurr.get_sys_time().time_since_epoch();

This gets rid of the need for tbase altogether. tcurr.get_sys_time().time_since_epoch() is the duration of time since 1970-01-01 00:00:00 UTC, in seconds. The precision of seconds is just for this example, since the input has seconds precision.

Style nit: Try to avoid putting conversion factors in your code. This means changing:

auto tcurr = make_zoned(tz, local_days{ymd} + 
        seconds{time_str.tm_hour*3600 + time_str.tm_min*60 + time_str.tm_sec}, choose::earliest);

to:

auto tcurr = make_zoned(tz, local_days{ymd} + hours{time_str.tm_hour} + 
                        minutes{time_str.tm_min} + seconds{time_str.tm_sec},
                        choose::earliest);

Is there a way to avoid this binary search if this time zone is also fixed. I mean can we get the time zone offset and DST offset and manually adjust the time point.

If you are not on Windows, try compiling with -DUSE_OS_TZDB=1. This uses a compiled-form of the database which can have higher performance.

There is a way to get the offset and apply it manually (https://howardhinnant.github.io/date/tz.html#local_info), however unless you know that your offset doesn't change with the value of the time_point, you're going to end up reinventing the logic under the hood of make_zoned.

But if you are confident that your UTC offset is constant, here's how you can do it:

auto tz = current_zone();
// Use a sample time_point to get the utc_offset:
auto info = tz->get_info(
    local_days{year{time_str.tm_year+1900}/(time_str.tm_mon+1)/time_str.tm_mday}
      + hours{time_str.tm_hour} + minutes{time_str.tm_min}
      + seconds{time_str.tm_sec});
seconds utc_offset = info.first.offset;
for( int i=0; i<RUNS; i++){

    genrandomdate(&time_str);
    // Apply the offset manually:
    auto ymd = year{time_str.tm_year+1900}/(time_str.tm_mon+1)/time_str.tm_mday;
    auto tp = sys_days{ymd} + hours{time_str.tm_hour} +
              minutes{time_str.tm_min} + seconds{time_str.tm_sec} - utc_offset;
    auto dp = tp.time_since_epoch();
}

Update -- My own timing tests

I'm running macOS 10.14.4 with Xcode 10.2.1. I've created a relatively quiet machine: Time machine backup is not running. Mail is not running. iTunes is not running.

I have the following application which implements the desire conversion using several different techniques, depending upon preprocessor settings:

#include "date/tz.h"
#include <cassert>
#include <iostream>
#include <vector>

constexpr int RUNS = 1'000'000;
using namespace date;
using namespace std;
using namespace std::chrono;

vector<tm>
gendata()
{
    vector<tm> v;
    v.reserve(RUNS);
    auto tz = current_zone();
    auto tp = floor<seconds>(system_clock::now());
    for (auto i = 0; i < RUNS; ++i, tp += 1s)
    {
        zoned_seconds zt{tz, tp};
        auto lt = zt.get_local_time();
        auto d = floor<days>(lt);
        year_month_day ymd{d};
        auto s = lt - d;
        auto h = floor<hours>(s);
        s -= h;
        auto m = floor<minutes>(s);
        s -= m;
        tm x{};
        x.tm_year = int{ymd.year()} - 1900;
        x.tm_mon = unsigned{ymd.month()} - 1;
        x.tm_mday = unsigned{ymd.day()};
        x.tm_hour = h.count();
        x.tm_min = m.count();
        x.tm_sec = s.count();
        x.tm_isdst = -1;
        v.push_back(x);
    }
    return v;
}


int
main()
{

    auto v = gendata();
    vector<time_t> vr;
    vr.reserve(v.size());
    auto tz = current_zone();  // Using date
    sys_seconds begin;         // Using date, optimized
    sys_seconds end;           // Using date, optimized
    seconds offset{};          // Using date, optimized

    auto t0 = steady_clock::now();
    for(auto const& time_str : v)
    {
#if 0  // Using mktime
        auto t = mktime(const_cast<tm*>(&time_str));
        vr.push_back(t);
#elif 1  // Using date, easy
        auto ymd = year{time_str.tm_year+1900}/(time_str.tm_mon+1)/time_str.tm_mday;
        auto tp = local_days{ymd} + hours{time_str.tm_hour} +
                  minutes{time_str.tm_min} + seconds{time_str.tm_sec};
        zoned_seconds zt{tz, tp};
        vr.push_back(zt.get_sys_time().time_since_epoch().count());
#elif 0  // Using date, optimized
        auto ymd = year{time_str.tm_year+1900}/(time_str.tm_mon+1)/time_str.tm_mday;
        auto tp = local_days{ymd} + hours{time_str.tm_hour} +
                  minutes{time_str.tm_min} + seconds{time_str.tm_sec};
        sys_seconds zt{(tp - offset).time_since_epoch()};
        if (!(begin <= zt && zt < end))
        {
            auto info = tz->get_info(tp);
            offset = info.first.offset;
            begin = info.first.begin;
            end = info.first.end;
            zt = sys_seconds{(tp - offset).time_since_epoch()};
        }
        vr.push_back(zt.time_since_epoch().count());
#endif
    }
    auto t1 = steady_clock::now();

    cout << (t1-t0)/v.size() << " per conversion\n";
    auto i = vr.begin();
    for(auto const& time_str : v)
    {
        auto t = mktime(const_cast<tm*>(&time_str));
        assert(t == *i);
        ++i;
    }
}

Each solution is timed, and then checked for correctness against a baseline solution. Each solution converts 1,000,000 timestamps, all relatively close together temporally, and outputs the average time per conversion.

I present four solutions, and their timings in my environment:

1. Use mktime.

Output:

3849ns per conversion

2. Use tz.h in the easiest way with USE_OS_TZDB=0

Output:

3976ns per conversion

This is slightly slower than the mktime solution.

3. Use tz.h in the easiest way with USE_OS_TZDB=1

Output:

55ns per conversion

This is much faster than the above two solutions. However this solution is not available on Windows (at this time), and on macOS does not support the leap seconds part of the library (not used in this test). Both of these limitations are caused by how the OS ships their time zone databases.

4. Use tz.h in an optimized way, taking advantage of the a-priori knowledge of temporally grouped time stamps. If the assumption is false, performance suffers, but correctness is not compromised.

Output:

15ns per conversion

This result is roughly independent of the USE_OS_TZDB setting. But the performance relies on the fact that the input data does not change UTC offsets very often. This solution is also careless with local time points that are ambiguous or non-existent. Such local time points don't have a unique mapping to UTC. Solutions 2 and 3 throw exceptions if such local time points are encountered.

Run time error with USE_OS_TZDB

The OP got this stack dump when running on Ubuntu. This crash happens on first access to the time zone database. The crash is caused by empty stub functions provided by the OS for the pthread library. The fix is to explicitly link to the pthreads library (include -lpthread on the command line).

==20645== Process terminating with default action of signal 6 (SIGABRT)
==20645==    at 0x5413428: raise (raise.c:54)
==20645==    by 0x5415029: abort (abort.c:89)
==20645==    by 0x4EC68F6: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==20645==    by 0x4ECCA45: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==20645==    by 0x4ECCA80: std::terminate() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==20645==    by 0x4ECCCB3: __cxa_throw (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==20645==    by 0x4EC89B8: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==20645==    by 0x406AF9: void std::call_once<date::time_zone::init() const::{lambda()#1}>(std::once_flag&, date::time_zone::init() const::{lambda()#1}&&) (mutex:698)
==20645==    by 0x40486C: date::time_zone::init() const (tz.cpp:2114)
==20645==    by 0x404C70: date::time_zone::get_info_impl(std::chrono::time_point<date::local_t, std::chrono::duration<long, std::ratio<1l, 1l> > >) const (tz.cpp:2149)
==20645==    by 0x418E5C: date::local_info date::time_zone::get_info<std::chrono::duration<long, std::ratio<1l, 1l> > >(std::chrono::time_point<date::local_t, std::chrono::duration<long, std::ratio<1l, 1l> > >) const (tz.h:904)
==20645==    by 0x418CB2: std::chrono::time_point<std::chrono::_V2::system_clock, std::common_type<std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1l> > >::type> date::time_zone::to_sys_impl<std::chrono::duration<long, std::ratio<1l, 1l> > >(std::chrono::time_point<date::local_t, std::chrono::duration<long, std::ratio<1l, 1l> > >, date::choose, std::integral_constant<bool, false>) const (tz.h:947)
==20645== 
Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • Thanks for your reply! yes `tbase` can be moved outside loop. I tested that option too. however `make_zoned(some_zone_here, ...)` call is still expensive. Is there a way to avoid this binary search if this time zone is also fixed. I mean can we get the time zone offset and DST offset and manually adjust the time point. – charitha22 May 17 '19 at 18:46
  • I've added to my answer to respond to the above comment. – Howard Hinnant May 17 '19 at 19:06
  • thanks. Looks like a viable option for me. But does this take care of the day light saving? – charitha22 May 17 '19 at 19:50
  • 1
    By definition, a time zone with daylight saving does not have a constant utc offset. – Howard Hinnant May 17 '19 at 19:52
  • Got it. Then this probably would be incorrect for some time points. I wonder how Java Calendar does it so fast! – charitha22 May 17 '19 at 19:54
  • An is it possible to obtain the day light saving policy for a specific time zone using this library? Say what time the DST starts and ends depending on time zone – charitha22 May 17 '19 at 20:10
  • 1
    The `local_info` returned from `time_zone->get_info(local_time)` is a simple struct with two `sys_info` in it: `first` and `second`. If the `local_time` has a unique mapping to utc, then `first` is a `sys_info` describing that mapping : https://howardhinnant.github.io/date/tz.html#sys_info `sys_info` has members `begin` and `end` which are the UTC time_points at which `offset` is valid (when it starts and ends). – Howard Hinnant May 17 '19 at 20:13
  • 7
    You sure know your way around this "Howard Hinnant's date library" :-) – Barry May 17 '19 at 20:37
  • thanks for your update. I will try this out again with USE_OS_TZDB=1 :) – charitha22 May 20 '19 at 17:24
  • How do I specify this argument if I want to use the library in my own application. Currently I am using something like.. g++-8 -I/date/include date/src/tz.cpp -lcurl tz_vs_mktime.cpp. On a separate note, for the current build instructions cmake complains for "USE_OS_TZDB", is it "USE_SYSTEM_TZ_DB"? – charitha22 May 20 '19 at 17:34
  • 1
    I do not recommend the use of CMake for this library unless you are a CMake expert and are willing to modify the CMakeLists.txt file to meet your needs. I recommend simply: g++-8 -I/date/include date/src/tz.cpp tz_vs_mktime.cpp -DUSE_OS_TZDB Note that it is not necessary to link to curl with this option. – Howard Hinnant May 20 '19 at 18:07
  • Thanks! I tested this with my program above. It is throwing an error : terminate called after throwing an instance of 'std::system_error' what(): Unknown error -1 Aborted (core dumped) – charitha22 May 20 '19 at 18:17
  • I'm not aware that this library throws such an error, though it may call something that throws that error and lets it propagate through. Do you by any chance have a stack dump? – Howard Hinnant May 20 '19 at 18:43
  • I've added the error I get above. Please have a look. I compiled my program as you suggested. how does the library works with this option? Maybe OS does not have the necessary packages installed. I ran it on Ubuntu 16.04. – charitha22 May 20 '19 at 20:45
  • Yes, it is possible that `current_zone()` is not ported to Ubuntu 16.04 (I have no idea whether it is or not). I don't see your error above. Save changes? – Howard Hinnant May 20 '19 at 21:22
  • Here's another communication channel which might have lower latency: https://gitter.im/HowardHinnant/date – Howard Hinnant May 20 '19 at 21:24
-2

I found that Google's CCTZ can do the same thing.

charitha22
  • 89
  • 9
  • 1
    Faster than 15ns per conversion? Do you have a benchmark for that? Because Howard does, and even the unoptimized version matches `mktime`. – Nicol Bolas May 19 '19 at 03:02