2

I'm looking to force one dataframe to fit the structure of another, according to certain criteria

Example data

## to be populated:
df_final <- data.frame("a"=numeric(), "b"=numeric(), "c"=numeric(), 
                       "l"=integer(), "m"=integer(), "n"=integer(), 
                       "x"=numeric(), "y"=numeric(), "z"=numeric())

> df_final
[1] a b c l m n x y z
<0 rows> (or 0-length row.names)

## data to coerce into df_final
df_data <- data.frame(col1=c(21.3,23.1,22.2),
                      col2=c(23.22,64.2,46.2), 
                      col3=c(NA_integer_,2L,3L), 
                      col4=c(23.2, 90.2,9.1))

> df_data
  col1  col2 col3 col4
1 21.3 23.22   NA 23.2
2 23.1 64.20    2 90.2
3 22.2 46.20    3  9.1

df_data has three 'sets' of columns:

  1. set1: up to 3 columns will be a 'decimal number' (the left-most columns)
  2. set2: up to 3 columns will be an integer
  3. set3: up to 3 columns will be a 'decimal number' (the right-most columns)

However, df_data will not always have 9 columns, and there may be some missing data in some of the columns (as in the example). And the column names of df_data will not match those in df_final

I need to 'fit' df_data into df_final, according to the rules:

  1. Columns a, b, c will have 'decimal numbers' from set1
  2. Columns l, m, n will only have integers from set2
  3. Columns x, y, z will have 'decimal numbers' from set3

where df_data has fewer than 3 columns for each set, I would like the missing columns in df_fnal to be NA

So my result will be

> df_final
   a    b     c  l  m  n  x  y    z
1 NA 21.3 23.22 NA NA NA NA NA 23.2
2 NA 23.1 64.20 NA NA  2 NA NA 90.2
3 NA 22.2 46.20 NA NA  3 NA NA  9.1

I'm not sure of the best way of doing this; at the moment I'm considering using regex on each row, finding all 'decimal' nubmers before 'integers', then all the integers, then all the 'decimals' after integers, but at the moment this seems overly complex, and I'm hoping there's an easier method that I've overlooked?

tospig
  • 7,762
  • 14
  • 40
  • 79
  • How do you know which columns are missing? E.g., if there is no integer column between, how will you assign them? Also, you didn't set `l`, `m`, `n` as integer columns. Neither `col3` as an integer column. It should be `col3=c(NA_integer_,2L,3L)` – David Arenburg Aug 02 '15 at 11:23
  • @DavidArenburg fixed the integers. There will always be at least one integer column (but maybe with `NA`s) between the numeric decimal columns. – tospig Aug 02 '15 at 11:33

2 Answers2

2

This solution only relies on R being able to identify the integer columns in df_data. It might fail is one of those columns was not read as integer (e.g. if it's full of NAs).

nr <- nrow(df_data)

# Define rows corresponding to sets 1,2,3
set2 <- which(sapply(df_data, class) == "integer")
set1 <- 1:(min(set2)-1)
set3 <- (max(set2)+1):length(df_data)

# Build the three components of df_final
part1 <- cbind(matrix(NA_real_, nrow=nr, ncol=3-length(set1)), df_data[,set1])
part2 <- cbind(matrix(NA_integer_, nrow=nr, ncol=3-length(set2)), df_data[,set2])
part3 <- cbind(matrix(NA_integer_, nrow=nr, ncol=3-length(set3)), df_data[,set3])

# Put it together and save column names
df_final <- data.frame(part1, part2, part3)
colnames(df_final) <- c("a","b","c","l","m","n","x","y","z")

Result:

> df_final
   a    b     c  l  m  n  x  y    z
1 NA 21.3 23.22 NA NA NA NA NA 23.2
2 NA 23.1 64.20 NA NA  2 NA NA 90.2
3 NA 22.2 46.20 NA NA  3 NA NA  9.1
1

In my opinion, it makes most sense to preallocate df_final with NAs and then index-assign the columns from df_data. The only trick is determining which columns need to be assigned.

I see that you want to right-justify (so-to-speak) the columns within the column sets. So, the requirement amounts to what I would describe as "cumulative matching" of the reversed column types of df_data in the reversed column types of df_final. In other words, you need to proceed from right-to-left through the column types of df_data and df_final and find the next (from the rightwards direction) match.

I am aware of various non-cumulative/cumulative pairs of functions in R, such as sum()/cumsum(), prod()/cumprod(), min()/cummin(), and max()/cummax() (actually I think those are the only ones), however there does not seem to be any kind of "cumulative match" function. So I wrote my own:

cummatch <- function(small,big) {
    cur <- 1L;
    res <- integer();
    biglen <- length(big);
    for (s in small) {
        if (cur > biglen) break;
        rescur <- match(s,big[cur:biglen])+cur-1L;
        if (is.na(rescur)) break;
        res[length(res)+1L] <- rescur;
        cur <- rescur+1L;
    };
    length(res) <- length(small);
    return(res);
};

And now we can use it to get the column indexes to assign:

cis <- ncol(df_final)+1L-rev(cummatch(rev(sapply(df_data,typeof)),rev(sapply(df_final,typeof))));
cis;
## [1] 2 3 6 9
df_final[nrow(df_data),1] <- NA; ## preallocate rows of NA
df_final;
##    a  b  c  l  m  n  x  y  z
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA
df_final[cis] <- df_data;
df_final;
##    a    b     c  l  m  n  x  y    z
## 1 NA 21.3 23.22 NA NA NA NA NA 23.2
## 2 NA 23.1 64.20 NA NA  2 NA NA 90.2
## 3 NA 22.2 46.20 NA NA  3 NA NA  9.1

From a performance perspective, my cummatch() function probably sucks, considering all the R-level looping and function calling (e.g. repeated invocations of match() on subvectors of big). I've been playing around with Rcpp lately, and so decided to try to write a more performant version in Rcpp. I referred to how can I handle vectors without knowing the type in Rcpp to try to figure out how to write a vector-type-agnostic function, and the solution is a bit hacky, involving a C++ template function with a wrapper function that switches on the TYPEOF() the vector, and thus has to basically instantiate a separate call to the function for each case within the switch. My function takes two vector arguments, so the RCPP_RETURN_VECTOR() macro isn't actually sufficient for it, but since both vectors must be the same type (for matching), I was able to massage the macros to work with two arguments rather than one. This involved applying the R type promotion rules manually in one of the macros, which I'm pretty sure I got right. Needless to say, this is probably reaching (or surpassing) the limits of what is reasonable to do with Rcpp. Anyway, here it is:

cppFunction('

    using namespace Rcpp;

    #define ___RCPP_HANDLE_CASE___2( ___RTYPE___ , ___FUN___ , ___OBJECT___1 , ___OBJECT___2 , ___RCPPTYPE___ ) \\
        case ___RTYPE___ : \\
            return ___FUN___( ::Rcpp::___RCPPTYPE___< ___RTYPE___ >( ___OBJECT___1 ), ::Rcpp::___RCPPTYPE___< ___RTYPE___ >( ___OBJECT___2 ) ) ;

    #define ___RCPP_RETURN___2( __FUN__, __SEXP__1 , __SEXP__2, __RCPPTYPE__ ) \\
        SEXP __TMP__1 = __SEXP__1 ; \\
        SEXP __TMP__2 = __SEXP__2 ; \\
        unsigned int __TMP__1_TYPE = TYPEOF( __TMP__1 ); \\
        unsigned int __TMP__2_TYPE = TYPEOF( __TMP__2 ); \\
        unsigned int __TMP__TYPE = __TMP__1_TYPE == RAWSXP ? __TMP__2_TYPE : __TMP__2_TYPE == RAWSXP ? __TMP__1_TYPE : std::max(__TMP__1_TYPE,__TMP__2_TYPE); /* note: the SEXPTYPE enumeration order *almost* aligns with the R type promotion rules; only raw is out-of-order, so we can test for that first, then use std::max() */ \\
        if (__TMP__1_TYPE < LGLSXP || __TMP__2_TYPE < LGLSXP) __TMP__TYPE = 0; \\
        switch( __TMP__TYPE ) { \\
            ___RCPP_HANDLE_CASE___2( INTSXP  , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            ___RCPP_HANDLE_CASE___2( REALSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            ___RCPP_HANDLE_CASE___2( RAWSXP  , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            ___RCPP_HANDLE_CASE___2( LGLSXP  , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            ___RCPP_HANDLE_CASE___2( CPLXSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            ___RCPP_HANDLE_CASE___2( STRSXP  , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) \\
            /* no == for generic ___RCPP_HANDLE_CASE___2( VECSXP  , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) */ \\
            /* no == for expression ___RCPP_HANDLE_CASE___2( EXPRSXP , __FUN__ , __TMP__1 , __TMP__2 , __RCPPTYPE__ ) */ \\
        default: \\
            throw std::range_error( "not a vector" ) ; \\
        }

    #define RCPP_RETURN_VECTOR2( _FUN_, _SEXP_1, _SEXP_2 )  ___RCPP_RETURN___2( _FUN_, _SEXP_1, _SEXP_2, Vector )
    #define RCPP_RETURN_MATRIX2( _FUN_, _SEXP_1, _SEXP_2 )  ___RCPP_RETURN___2( _FUN_, _SEXP_1, _SEXP_2, Matrix )

    template<typename T> IntegerVector cummatch_impl(T small, T big ) {
        int smalllen = LENGTH(small);
        IntegerVector res(smalllen,NA_INTEGER);
        int cur = 0;
        int biglen = LENGTH(big);
        for (int si = 0; si < smalllen; ++si) {
            int rescur = NA_INTEGER;
            for (int bi = cur; bi < biglen; ++bi) {
                if (small(si) == big(bi)) {
                    rescur = bi;
                    break;
                }
            }
            if (rescur == NA_INTEGER) break;
            res(si) = rescur+1;
            cur = rescur+1;
        }
        return res;
    }

    // [[Rcpp::export]]
    IntegerVector cummatch(SEXP small, SEXP big ) { RCPP_RETURN_VECTOR2(cummatch_impl,small,big); }

');
Community
  • 1
  • 1
bgoldst
  • 34,190
  • 6
  • 38
  • 64