matrix index matching for large raster data

Question

I have a large raster data (X) with a dimension of 32251*51333. The values of X are repetitions of another array (Y), which has a size of 3*10^6. Now I want to change the values of X by matching it against each value of Y, for example I can program like this,

for (i in 1:length(Y)){
 X[X==Y[i]] = Z[i]   #Z is just another array with the same size as Y
}

The problem is that, first the index matching X[X==Y[i]] = Z[i] does not work because X is too large. After a few minutes the program just stops by giving an error "Error: cannot allocate vector of size 6.2 Gb". Second, going over the loops from 1 to length(Y), even though Y is of size 10^6, may take "forever" to complete.

One approach came to my mind is to split X into small chunks and then do the index match for each chunk. But I feel this would still take a lot of time.

Is there a better way to achieve the above goal?

1st Update:

Thanks to the example provided by @Lyngbakr, I will elaborate this question further. Because the raster I'm working with is very large (32251*51333), it seems not possible to upload it. The example given by @Lyngbakr is very similar to what I want, except that the raster created is too small. Now following the example, I ran two tests by generating a much larger raster with dimension of 3000*2700. See code below.

#Method 1: Use subs
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
df <- data.frame(Y, Z)
X <- subs(X, df)
end_time <- Sys.time()
end_time - start_time
#Time difference of 2.248908 mins

#Method 2: Use for loop
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
for (i in 1:length(Y)){
  X[X==Y[i]]=Z[i] #this indexing of R seems not efficient if X becomes large
}
end_time <- Sys.time()
end_time - start_time
#Time difference of 10.22717 secs

As you can see, a simple for loop is even more efficient than the subs function. Remember, the raster shown in the example is still smaller than what I work with (about an order of 100 smaller). Also, the array Y in the example is very small. Now the question could be, how to speed up the Method 2, which is just a simple for loop?

We need a small example of each of these raster objects. First load any needed packages with `library` calls, then make small examples and state what is expected as a result. — IRTFM, Jun 25 '19 at 20:43

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You're looking for the subs function. I don't know if it works with large rasters, but here's how you'd try.

I load the raster package and create some dummy data. (It would be really helpful if you provide data in your question.) Then, I plot the results.

# Load library
library(raster)
#> Loading required package: sp

# Z holds values that will replace Y
Y <- 1:9
Z <- 91:99

# Create dummy raster
X <- raster(matrix(rep(Y, 3), ncol = 9))

# Examine raster
plot(X)

As you can see, X is just a bunch of Y vectors patched together. Next, I bind Y and Z together into a data frame df.

# Combine y & z into a data frame
df <- data.frame(Y, Z)

Finally, I use subs to replace Y values with Z values.

# Substitute Z for Y in X
X <- subs(X, df)

A quick look at the raster shows that the values have been replaced correctly.

# Examine raster
plot(X)

^{Created on 2019-06-25 by the reprex package (v0.2.1.9000)}

Update

Rcppis really helpful when performance is an issue. Below, I compare three methods:

Looping in R (from the question)
Using subs from the raster package
Looping in C++ using Rcpp

By the way, Sys.time() isn't a great way to examine performance, so I'd recommend microbenchmark instead.

# Load library
library(raster)

# Define vectors and raster
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow = 3000, ncol = 2700))

method_1 is the subs function.

# Using subs function
method_1 <- function(){
  df <- data.frame(Y, Z)
  X <- subs(X, df)
}

method_2 is your original looping approach.

# Using R loop
method_2 <- function(){
  for (i in 1:length(Y)){
    X[X==Y[i]]=Z[i] 
  }
  X
}

method_3 is the looping approach implemented in C++.

# Using Rcpp loops
src <-
"Rcpp::NumericMatrix subs_cpp(Rcpp::NumericMatrix X, Rcpp::NumericVector Y, Rcpp::NumericVector Z){
  for(int i = 0; i < Y.length(); ++i){
    for(int j = 0; j < X.ncol(); ++j){
      for(int k = 0; k < X.nrow(); ++k){
        if(X(k, j) == Y(i)){
          X(k, j) = Z(i);
        }
      }
    }
  }  

  return X;
}"

Rcpp::cppFunction(src)

method_3 <- function(){
  subs_cpp(as.matrix(X), Y, Z)
}

And here I benchmark the approaches.

# Run benchmarking
microbenchmark::microbenchmark(method_1(), method_2(), method_3(), times = 10)

# Unit: milliseconds
#       expr        min         lq       mean     median         uq       max neval
# method_1() 16861.5447 17737.2124 19321.5674 18628.8573 20117.0159 25506.208    10
# method_2()   671.2223   677.6029  1111.3935   738.6216  1657.0542  2163.137    10
# method_3()   316.9810   319.1484   481.3548   320.2337   326.7133  1477.454    10

As you can see, the Rcpp approach is by far the fastest.

You can also compare the output to ensure they produce the same result using a smaller raster.

# Examine all three outputs with smaller raster
X <- raster(matrix(rep(Y, 3), ncol = 9))

plot(method_1(), main = "Method 1")
plot(method_2(), main = "Method 2")
plot(raster(method_3()), main = "Method 3") # Needs to converted into a raster

And they all look alike. Note that for the third method, the result needs to be converted back to a raster from a matrix.

Yes. It was faster. But I was hoping to get an answer with an elegant algorithm that is simple, such as vectorization. I feel this straightforward problem should have some shortcut solutions. — uPhone, Jul 06 '19 at 18:15

matrix index matching for large raster data

1st Update:

1 Answers1

Update