In my opinion, it makes most sense to preallocate df_final
with NAs and then index-assign the columns from df_data
. The only trick is determining which columns need to be assigned.
I see that you want to right-justify (so-to-speak) the columns within the column sets. So, the requirement amounts to what I would describe as "cumulative matching" of the reversed column types of df_data
in the reversed column types of df_final
. In other words, you need to proceed from right-to-left through the column types of df_data
and df_final
and find the next (from the rightwards direction) match.
I am aware of various non-cumulative/cumulative pairs of functions in R, such as sum()
/cumsum()
, prod()
/cumprod()
, min()
/cummin()
, and max()
/cummax()
(actually I think those are the only ones), however there does not seem to be any kind of "cumulative match" function. So I wrote my own:
cummatch <- function(small,big) {
cur <- 1L;
res <- integer();
biglen <- length(big);
for (s in small) {
if (cur > biglen) break;
rescur <- match(s,big[cur:biglen])+cur-1L;
if (is.na(rescur)) break;
res[length(res)+1L] <- rescur;
cur <- rescur+1L;
};
length(res) <- length(small);
return(res);
};
And now we can use it to get the column indexes to assign:
cis <- ncol(df_final)+1L-rev(cummatch(rev(sapply(df_data,typeof)),rev(sapply(df_final,typeof))));
cis;
## [1] 2 3 6 9
df_final[nrow(df_data),1] <- NA; ## preallocate rows of NA
df_final;
## a b c l m n x y z
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA
df_final[cis] <- df_data;
df_final;
## a b c l m n x y z
## 1 NA 21.3 23.22 NA NA NA NA NA 23.2
## 2 NA 23.1 64.20 NA NA 2 NA NA 90.2
## 3 NA 22.2 46.20 NA NA 3 NA NA 9.1
From a performance perspective, my cummatch()
function probably sucks, considering all the R-level looping and function calling (e.g. repeated invocations of match()
on subvectors of big
). I've been playing around with Rcpp lately, and so decided to try to write a more performant version in Rcpp. I referred to how can I handle vectors without knowing the type in Rcpp to try to figure out how to write a vector-type-agnostic function, and the solution is a bit hacky, involving a C++ template function with a wrapper function that switch
es on the TYPEOF()
the vector, and thus has to basically instantiate a separate call to the function for each case
within the switch
. My function takes two vector arguments, so the RCPP_RETURN_VECTOR()
macro isn't actually sufficient for it, but since both vectors must be the same type (for matching), I was able to massage the macros to work with two arguments rather than one. This involved applying the R type promotion rules manually in one of the macros, which I'm pretty sure I got right. Needless to say, this is probably reaching (or surpassing) the limits of what is reasonable to do with Rcpp. Anyway, here it is:
cppFunction('
using namespace Rcpp;
#define ___RCPP_HANDLE_CASE___2( ___RTYPE___ , ___FUN___ , ___OBJECT___1 , ___OBJECT___2 , ___RCPPTYPE___ ) \\
case ___RTYPE___ : \\
return ___FUN___( ::Rcpp::___RCPPTYPE___< ___RTYPE___ >( ___OBJECT___1 ), ::Rcpp::___RCPPTYPE___< ___RTYPE___ >( ___OBJECT___2 ) ) ;
#define ___RCPP_RETURN___2( __FUN__, __SEXP__1 , __SEXP__2, __RCPPTYPE__ ) \\
SEXP __TMP__1 = __SEXP__1 ; \\
SEXP __TMP__2 = __SEXP__2 ; \\
unsigned int __TMP__1_TYPE = TYPEOF( __TMP__1 ); \\
unsigned int __TMP__2_TYPE = TYPEOF( __TMP__2 ); \\
unsigned int __TMP__TYPE = __TMP__1_TYPE == RAWSXP ? __TMP__2_TYPE : __TMP__2_TYPE == RAWSXP ? __TMP__1_TYPE : std::max(__TMP__1_TYPE,__TMP__2_TYPE); /* note: the SEXPTYPE enumeration order *almost* aligns with the R type promotion rules; only raw is out-of-order, so we can test for that first, then use std::max() */ \\
if (__TMP__1_TYPE < LGLSXP || __TMP__2_TYPE < LGLSXP) __TMP__TYPE = 0; \\
switch( __TMP__TYPE ) { \\
___RCPP_HANDLE_CASE___2( INTSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
___RCPP_HANDLE_CASE___2( REALSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
___RCPP_HANDLE_CASE___2( RAWSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
___RCPP_HANDLE_CASE___2( LGLSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
___RCPP_HANDLE_CASE___2( CPLXSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
___RCPP_HANDLE_CASE___2( STRSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
/* no == for generic ___RCPP_HANDLE_CASE___2( VECSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) */ \\
/* no == for expression ___RCPP_HANDLE_CASE___2( EXPRSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) */ \\
default: \\
throw std::range_error( "not a vector" ) ; \\
}
#define RCPP_RETURN_VECTOR2( _FUN_, _SEXP_1, _SEXP_2 ) ___RCPP_RETURN___2( _FUN_, _SEXP_1, _SEXP_2, Vector )
#define RCPP_RETURN_MATRIX2( _FUN_, _SEXP_1, _SEXP_2 ) ___RCPP_RETURN___2( _FUN_, _SEXP_1, _SEXP_2, Matrix )
template<typename T> IntegerVector cummatch_impl(T small, T big ) {
int smalllen = LENGTH(small);
IntegerVector res(smalllen,NA_INTEGER);
int cur = 0;
int biglen = LENGTH(big);
for (int si = 0; si < smalllen; ++si) {
int rescur = NA_INTEGER;
for (int bi = cur; bi < biglen; ++bi) {
if (small(si) == big(bi)) {
rescur = bi;
break;
}
}
if (rescur == NA_INTEGER) break;
res(si) = rescur+1;
cur = rescur+1;
}
return res;
}
// [[Rcpp::export]]
IntegerVector cummatch(SEXP small, SEXP big ) { RCPP_RETURN_VECTOR2(cummatch_impl,small,big); }
');