0

I know there were a lot of answers already here to shift the non-NA values to the left, rowwise. But all of these will take me forever to do this. Is there a fastest way to perform this task? Example:

#from
X1 X2 X3 X4 X5 X6 X7
NA NA AB NA AD AE AF
NA NA NA AG NA AI AJ
NA AK AL AM NA AO AP
NA NA AQ NA AS AT NA
AV AW AX AY AZ NA BB

#to
X1 X2 X3 X4 X5 X6 X7
AB AD AE AF NA NA NA
AG AI AJ NA NA NA NA 
AK AL AM AO AP NA NA
AQ AS AT AU NA NA NA
AV AW AX AY AZ BB NA

Using apply and/or for loops take a lot of time. For context, I have a dataframe with 340K rows and 67 columns and it will take me 18+ hours to do the job if I ran the following:

    for (i in 1:nrow(df)) {
      Temp <- unlist(df[i,])
      ndf[i,] <- t(c(Temp[!is.na(Temp)],Temp[is.na(Temp)]))
    }

Other suggested solutions in the other posts seems to be similar to this one, so I would also expect to take a long time.

I've also tried following code:

ndf <- na_move(df) #from package: dedupewider

But it seems that the it hasn't done the job for the last 3 columns, as follows:

#to
X1 X2 X3 X4 X5 X6 X7
AB NA NA NA AD AE AF
AG NA NA NA NA AI AJ
AK AL AM NA NA AO AP
AQ NA NA NA AS AT NA
AV AW AX AY AZ NA BB

Hoping for a solution for this. Thank you very much!

mcmalicsi
  • 11
  • 1
  • Can you provide a small matrix and what you expect the output of the operation to be for that matrix? – Mikael Jagan Dec 30 '21 at 04:01
  • Hi Mikael, here. But I don't need to sort the non-NA values either. Thanks! https://stackoverflow.com/questions/26651606/how-to-move-cells-with-a-value-row-wise-to-the-left-in-a-dataframe – mcmalicsi Dec 30 '21 at 04:03
  • Is your matrix also a character matrix? Or is it numeric? – Mikael Jagan Dec 30 '21 at 04:28
  • It's a character matrix. – mcmalicsi Dec 30 '21 at 04:28
  • 1
    Instead of providing an example, you just link to another question that has 5 answers? And that question is marked as a duplicate of another question that has 7 answers? That's frowned upon. Have you tested all 12 available answers? 340k rows and 67 columns isn't **so** big... I would expect it to run in minutes, not 18 hours. If you want to optimize code for data of your class and size, I'd strongly recommend sharing code to simulate appropriately complex sample data for benchmarks to be meaningful. Otherwise I think this question will just get closed as a duplicate. – Gregor Thomas Dec 30 '21 at 04:43
  • 1
    Also, you say you have a `matrix`, but the linked example is a data frame. Which is it? (I would expect row-wise operations on a matrix to be faster...) – Gregor Thomas Dec 30 '21 at 04:46
  • Apologies. I'm relatively just a newbie here in R and the community. I just edited the post for more context. Thank you very much! – mcmalicsi Dec 30 '21 at 05:30
  • @mcmalicsi the question is closed so I can't really add another answer but you were on the right track with na_move if you have a dataframe not a matrix you just forgot the cols argument i.e. `cols = c("X1", "X2", "X3", "X4", "X5", "X6", "X7")` it is very fast and respects datatype I tested on my old Mac 350,000 rows your 7 columns of interest plus some irrelevants and it benched at median = 738 milliseconds – Chuck P Jul 22 '22 at 12:15

1 Answers1

3

Here is an Rcpp implementation of your exact task: given a character matrix x, the function shift_na returns a sorted matrix y such that

identical(y[i, ], x[i, order(is.na(x[i, ]))])

is TRUE for all i. On my machine, it sorts a 340000-by-67 character matrix in around 0.3 seconds. See below.

Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void shift_na_in_place(CharacterMatrix x)
{
  int m = x.nrow();
  int n = x.ncol();
  for (int i = 0, k = 0, k0 = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
      if (x[k] != NA_STRING) {
        x[k0] = x[k];
        k0 += m;
      }
      k += m;
    }
    while (k0 < k) {
      x[k0] = NA_STRING;
      k0 += m;
    }
    k = (k % m) + 1;
    k0 = k;
  }
  if (x.attr("dimnames") != R_NilValue) {
    List dn = x.attr("dimnames");
    dn[1] = R_NilValue;
    if (dn.attr("names") != R_NilValue) {
      CharacterVector ndn = dn.attr("names");
      ndn[1] = "";
    }
  }
}

// [[Rcpp::export]]
CharacterMatrix shift_na(CharacterMatrix x)
{
  CharacterMatrix y = clone(x);
  shift_na_in_place(y);
  return y;
}
')

Test for correctness with a 6-by-6 matrix:

f <- function(d) {
  x <- sample(c(letters, NA), size = prod(d), replace = TRUE, prob = c(rep(1, 26), 13))
  dim(x) <- d
  x
}
set.seed(1L)
x <- f(c(6L, 6L))
x
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA   "z"  "d"  "p"  NA   "h" 
[2,] "p"  "o"  "p"  "t"  "e"  "m" 
[3,] "l"  "n"  "t"  "z"  NA   "i" 
[4,] "y"  NA   "i"  NA   "p"  NA  
[5,] NA   NA   "q"  "o"  "w"  "v" 
[6,] "y"  NA   "a"  NA   "c"  "d"
shift_na(x)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "z"  "d"  "p"  "h"  NA   NA  
[2,] "p"  "o"  "p"  "t"  "e"  "m" 
[3,] "l"  "n"  "t"  "z"  "i"  NA  
[4,] "y"  "i"  "p"  NA   NA   NA  
[5,] "q"  "o"  "w"  "v"  NA   NA  
[6,] "y"  "a"  "c"  "d"  NA   NA 

Benchmark with a 340000-by-67 matrix:

x <- f(c(340000L, 67L))
microbenchmark::microbenchmark(shift_na(x))
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 shift_na(x) 258.4182 263.9208 296.4804 287.7001 318.1688 366.1472   100

You can use shift_na_in_place if you can't afford to allocate memory for a sorted matrix and don't need to preserve the unsorted matrix.

Edit: If you are starting with a data frame data containing character variables, rather than a character matrix, then do this:

x <- as.matrix(data)
shift_na_in_place(x)
newdata <- as.data.frame(x)
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48