0

I'm tring to parse JSON from huge dataset with 20: As I append data over and over the process takes lot of time. And seems like times increase exponentially by rows. Hence, I thought of dividing the data into chinks, and dealing it chunk by chunk. The inner loop works fine, but I can append through the chunks.

In addition, ideally I'd like to take the subset into chunks out of the inner foreach but once I do that I get another error.

 chunk <- 1000
 n <- nrow(daily.db)
 chunkn<-ceiling(n/chunk)

db<-rbindlist(foreach(i = 1:length(chunkn)) %:%
        rbindlist(foreach(j=1:nrow(subset.db)) %dopar% {
            subset.db<-daily.db[((i-1)*1000+1):min(((i-1)*1000+1)+999,length(daily.db$filter))]
            json1<-jsonlite::fromJSON(txt =subset.db$filter[j])
            .db<-as.data.table(t(unlist(json1)))
            .db},fill=TRUE)
        ,fill = TRUE)
letmetype
  • 105
  • 7

1 Answers1

0

It seems as a better practice is to make a function for the inner loop:

parralelparsing<-function(subset.db) {
 rbindlist(foreach(j=1:nrow(subset.db)) %dopar% {
    json1<-jsonlite::fromJSON(txt =subset.db$filter[j])
    .db<-as.data.table(t(unlist(json1)))
    .db},fill=TRUE)
    }

And then by using loop, employing this function over a chunk of code:

chunk <- 10000
n <- nrow(daily.db)
chunkn<-ceiling(n/chunk)
db<-NULL

for (i in 1:chunkn) {
   .subset.db<-daily.db[((i-1)*chunk+1):min(((i-1)*chunk+1)+chunk-1,length(daily.db$filter))]
   .db<-parralelparsing(.subset.db)
   db<-rbindlist(list(db,.db),fill = T)
   }
letmetype
  • 105
  • 7