7

I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.

Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe at once.

The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:

# session setup
library("raster")
library("rgdal")

# multicore processing. 
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked

# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")

# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE

# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
  return(table(extract(r.landuse,spodf.pol)))}

# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])

## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")

# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
                   .packages = l.packages) %dopar% {
                     print(paste("Processing Polygon ",i, ".",sep=""))
                     return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
                     }

here the output is a list that looks like this.

l.results

[[1]]

9     10 
193159   2567 

[[2]]

7    9   10   12   14   16 
17  256 1084  494   67   15 

[[3]]

3      5      6      7      9     10     11     12 
2199   1327   8840   8579 194437   1061   1073   1834 
14     16 
222   1395 

[[4]]

3      6      7      9     10     12     16 
287    102    728 329057   1004   1057     31 

[[5]]

3      5      6      7      9     12     16 
21      6     20    495 184261   4765     28 

[[6]]

6    7    9   10   12   14 
161  161  386  943  205 1515 

So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.

# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
                   .packages = l.packages) %dopar% {
                     print(paste("Processing Polygon ",i, ".",sep=""))
                     return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
                     # gc(reset=TRUE) # does not resolve the problem
                     # closeAllConnections()  # does not resolve the problem
                   }

What am I doing wrong?

edit: I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:

clusterExport(cl = cluster,
              varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))

without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel package as suggested in the first comment but no differences appeared.

joaoal
  • 1,892
  • 4
  • 19
  • 29
  • 2
    I can only provide some guesses/suggestions. Guess 1: `foreach` exports data `r.terra.2008` and `spodf.malha.2007[i,]` to each core with each iteration. Suggestion 1: Try removing `rm()` objects on each core and then garbage collect `gc()`. (I'm not sure if `return` a value will exit you from your `%dopar%` code block.) Suggestion 2: It looks like you're still using `doSNOW`, which I've heard is not well supported (hearsay). I would try a recent version of `library(parallel)`. Look at this link `https://github.com/HenrikBengtsson/Wishlist-for-R/issues/27` as well. – CPak Jun 12 '17 at 22:19
  • none of the above worked for me (unfortunately). see my edits above. – joaoal Jun 13 '17 at 09:58
  • 1
    This is actually related to the raster package. Here is some help if anyone has a related problem: https://stackoverflow.com/questions/25426405/raster-package-taking-all-hard-drive – joaoal Jul 25 '19 at 09:13

1 Answers1

0

You can try exact_extract in the exactextractr package. Is the fastest and memory safer function to extract values from raster. The main function is implemented in C++ and usually it doesn't need parallelization. Since you do not provide any example data I post an example with real data:

library(raster)
library(sf)
library(exactextractr)


# Pull municipal boundaries for Brazil
brazil <- st_as_sf(getData('GADM', country='BRA', level=2))

# Pull gridded precipitation data
prec <- getData('worldclim', var='prec', res=10)
#transform precipitation data in a dummy land use map
lu <- prec[[1]]
values(lu) <- sample(1:10,ncell(lu),replace = T)
plot(lu)

#extract land uses class for each pixel inside each polygon
ex <- exact_extract(lu, brazil)
#apply table to the resulting list. Here I use just the first 5 elements to avoid long output

lapply(ex[1:5],function(x){
  table(x[,1])#note that I use x[,1] because by default exact_extract provide in the second column the coverage fraction of each pixel by each polygon
})

here the example output:

[[1]]

 1  2  4  6  7  9 10 
 1  1  1  2  3  1  1 

[[2]]

 2  3  4  5  6  7  8 10 
 2  4  3  2  1  2  2  2 

[[3]]

 1  2  4  6  7  8  9 10 
 4  5  1  1  4  2  5  5 

[[4]]

 1  2  3  4  5  6  7  8  9 10 
 2  2  4  2  2  4  1  4  1  2 

[[5]]

 3  4  5  6  8 10 
 2  3  1  1  2  3 
Elia
  • 2,210
  • 1
  • 6
  • 18