0

I have thousands of very small json files in a directory.

Right now, I am using the following code to load them:

library(dplyr)
library(jsonlite)
library(purrr)

filelistjson <- list.files(DATA_DIRECTORY, full.names = TRUE, recursive = TRUE)
filelistjson %>% map(., ~fromJSON(file(.x)))

Unfortunately, this is extremely slow (I also tried with furrr::future_map) I wonder if there is a better approach here. the individual files are barely 25KB in size...

The files look look like the following, with a couple nested variables but nothing too complicated

  {
 "field1": "hello world",
  "funny": "yes",
  "date": "abc1234",
  "field3": "hakuna matata",
  "nestedvar":[
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/funny"
  ],
  "othernested":[
   { 
     "one": "two",
     "test": "hello"
   }
   ] 
  }

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

2

There are several JSON libraries in R. Here are benchmarks for three of the libraries:

txt <- '{
 "field1": "hello world",
"funny": "yes",
"date": "abc1234",
"field3": "hakuna matata",
"nestedvar": [
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/funny"
],
"othernested": [
{ 
"one" : "two",
"test" : "hello"
}
] 
}'

microbenchmark::microbenchmark(
  jsonlite={
    jsonlite::fromJSON(txt)
  },
  RJSONIO={
    RJSONIO::fromJSON(txt)
  },
  rjson={
    rjson::fromJSON(txt)
  }
)

# Unit: microseconds
#     expr     min       lq      mean  median      uq     max neval cld
# jsonlite 144.047 153.3455 173.92028 167.021 172.491 456.935   100   c
#  RJSONIO 113.049 120.3420 134.94045 128.365 132.742 287.727   100  b 
#    rjson  10.211  12.4000  17.10741  17.140  18.234  59.807   100 a 

As you can see, rjson seems to be more efficient (though treat the above results with caution). Personally, I like working with RJSONIO as it is the library that in my experience respects best the formats when reading, modifying and parsing again.

Finally, if you know the (invariant) structure of your files, you can always build a custom JSON reader and maybe be more efficient. But as indicated by @Gregor, maybe you ought to make sure the latency is truly due to the reader.

niko
  • 5,253
  • 1
  • 12
  • 32
  • 1
    very interesting. would they all work with a `file` argument? Here you are loading a string. – ℕʘʘḆḽḘ Apr 02 '19 at 13:07
  • 2
    I'm surprised by the magnitude of the difference, but when the times are in microseconds and OP has "thousands" of these files (not millions), that still doesn't add up to very much time. Very likely reading from disk is the bottleneck, not parsing the JSON. – Gregor Thomas Apr 02 '19 at 13:08
  • 1
    The different methods act similarly with the `txt` argument: it can be a local file, a file on the web, etc. However, the results are stored differently: `jsonlite` works a lot with data frames, whereas `RJSONIO` works more with lists. @Gregor Yes that makes sense. – niko Apr 02 '19 at 13:12