2

Hello I have astd::vector<std::string> containing datetimes like 2011-03-23T12:23:32.123 from this I'd like to gen' 2 vectors of int 20110323 and 122332123.

I am using a C++ library called Rcpp (that's not really the problem here I think but you never know so I put the Rcpp tag)

I did this which does the job but that is pretty slow, how can I speed this up ?

Rcpp::List datetimeToInt(vector<string> datetimes){

    const int N=datetimes.size();
    Rcpp::IntegerVector date(N);  //please consider those as std::vector<int>
    Rcpp::IntegerVector time(N);

    //this is what I want to speed up
    for(int i=0; i<N; ++i){
        datetimes[i].erase(std::remove_if(datetimes[i].begin(), datetimes[i].end(), not1(ptr_fun(::isdigit))), datetimes[i].end());
        date[i] = atoi(datetimes[i].substr(0,8).c_str());
        time[i] = atoi(datetimes[i].substr(8,12).c_str());
    }

    return Rcpp::List::create(_["date"]=date, _["time"]=time); 
}
statquant
  • 13,672
  • 21
  • 91
  • 162

3 Answers3

1

Your code is quite optimal, the only change that you could make is replace this part

    datetimes[i].erase(std::remove_if(datetimes[i].begin(), datetimes[i].end(), not1(ptr_fun(::isdigit))), datetimes[i].end());
    date[i] = atoi(datetimes[i].substr(0,8).c_str());
    time[i] = atoi(datetimes[i].substr(8,12).c_str());

with something more sophisticated and optimized, for example smt like this (but I didn't test it):

int dateId = 0;
int timeId = 0;
char time_c[9];
char date_c[8];

for (int strId = 0; i < str.length(); ++strId) {
    if (isdigit(datetimes[i][strId]) {
        if (dateId >= 8) {
            time_c[timeId] = datetimes[i][strId];
            ++timeId;
        } else {
            date_c[dateId] = datetimes[i][strId];
            ++dateId;
        }
    } 
}

date[i] = atoi(date_c);
time[i] = atoi(time_c);

It splits your string in two only in one pass

jtomaszk
  • 9,223
  • 2
  • 28
  • 40
1

Using a std::vector<std::string>, we have to make a copy of the strings. This is a waste of time. You should use a CharacterVector which does not need to make copies as you work directly with the data.

// [[Rcpp::export]]
List datetimeToInt2(CharacterVector datetimes){

    const int N=datetimes.size();
    IntegerVector date(N); 
    IntegerVector time(N);
    std::string current ; 

    //this is what I want to speed up
    for(int i=0; i<N; ++i){
        current = datetimes[i] ;
        current.erase(std::remove_if(current.begin(), current.end(), std::not1(std::ptr_fun(::isdigit))), current.end());
        date[i] = atoi(current.substr(0,8).c_str());
        time[i] = atoi(current.substr(8,12).c_str());
    }

    return List::create(_["date"]=date, _["time"]=time); 
}        

Let's measure this:

> dates <- rep("2011-03-23T12:23:32.123", 1e+05)
> system.time(res1 <- datetimeToInt(dates))
    user  system elapsed
   0.081   0.006   0.087
> system.time(res2 <- datetimeToInt2(dates))
    user  system elapsed
   0.044   0.000   0.044
> identical(res1, res2)
[1] TRUE    
Romain Francois
  • 17,432
  • 3
  • 51
  • 77
  • Thanks Romain! I am pretty unsecure with `CharacterVector` vectors, I did not realize you could apply STL algos to elements of `CharacterVector`. Typically it seems that you cannot compare elements of 2 `CharacterVector` with operator `==`... I am a bit puzzled about this – statquant Jun 19 '13 at 11:14
  • Well the important thing is `current = datetimes[i] ;` it gets the element and assigns it into a `std::string`. – Romain Francois Jun 19 '13 at 11:40
  • Don't keep your frustrations private. If there is something you'd like to see (e.g. support for the `operator==` let us know on the Rcpp mailing list. Sometimes, stuff isn't there because we haven't had the need yet. – Romain Francois Jun 19 '13 at 11:41
  • 1
    "Well the important thing is current = datetimes[i]" -> "pfff, donnez moi une corde", as far as `operator ==` is concerned it was raised here so I though that was why `vector` was for http://stackoverflow.com/questions/7874697/how-to-test-rcppcharactervector-elements-for-equality/7875918#7875918 – statquant Jun 19 '13 at 11:45
  • Ah. Did not remember this one. We can do better than what the answers propose. I'll have a go at implementing `==`. – Romain Francois Jun 19 '13 at 11:48
  • I added the needed `operator==` – Romain Francois Jun 19 '13 at 13:07
1

You may want to look at the fasttime package by Simon (available here on rforge.net) which does something very similar.

It splits ISO datetime strings (albeit with the 'T' separator) assumed to be UTC times using just string ops and no date parsing. I used it all the time work as it fits my needs there.

And as a note, you may want to think more carefully about when you use STL containers, and when you use Rcpp containers.

Lastly, do not use string or int for date arithmetic or comparisons when you could use proper date types---which R, C++ and Rcpp have.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Hello Dirk, thanks for your suggestions. They all hold in general but not really in my particular usecase. I am already a user of `fasttime` and `fastPOSIXct`, this is even why I asked this question, my solution being less efficient than `fastPOSIXct` even if it was doing less. I use `integer` for date and times (up to milli) because I am a heavy user of `data.table`, which leverge `radix` sort (and more). You gain a lot on specific problems using `integer` instead of `double` (my use case). Of course you're right about `POSIXct` vs `string` or `integer` on a general basis. – statquant Jun 19 '13 at 14:21
  • And I am waiting for your book as far as STL vs Rcpp containers are concerned, as my programming knowledge is poor. BUT I am very interested in what you would suggest here ! – statquant Jun 19 '13 at 14:41
  • I would suggest to stick to Rcpp types unless you need something specifically from other types (stl, ...). Using Rcpp types will use R's own memory and therefore migth save you from making too many copies. A typical case when to use stl types over Rcpp types is when you grow/shrink data structures. Rcpp, being restricted by R's data model does not do a good job here. – Romain Francois Jun 21 '13 at 11:17