I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.
Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe
at once.
The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here the output is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit: I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel
package as suggested in the first comment but no differences appeared.