how to quickly parse many small JSON files?

Question

I have thousands of very small json files in a directory.

Right now, I am using the following code to load them:

library(dplyr)
library(jsonlite)
library(purrr)

filelistjson <- list.files(DATA_DIRECTORY, full.names = TRUE, recursive = TRUE)
filelistjson %>% map(., ~fromJSON(file(.x)))

Unfortunately, this is extremely slow (I also tried with furrr::future_map) I wonder if there is a better approach here. the individual files are barely 25KB in size...

The files look look like the following, with a couple nested variables but nothing too complicated

  {
 "field1": "hello world",
  "funny": "yes",
  "date": "abc1234",
  "field3": "hakuna matata",
  "nestedvar":[
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/funny"
  ],
  "othernested":[
   { 
     "one": "two",
     "test": "hello"
   }
   ] 
  }

Thanks!

Can you add example of at least one file and specify what packages are you using? — pogibas, Apr 02 '19 at 12:22
Your JSON is flawed - switch the `=` to `:` and you forgot an element or have a not needed comma. — niko, Apr 02 '19 at 13:00
When you tried with `future_map` did you `plan()` to run in parallel? — Gregor Thomas, Apr 02 '19 at 13:05
This is just an untested idea, but it may be faster to create one big JSON file with a command line tool and parse it all at once. (Think something like `{ "file1": {}, "file2": {}, ...}`) — Gregor Thomas, Apr 02 '19 at 13:11
You would want to do it with a command line tool, not R. (You could call the commands from R, but you would use perhaps bash or PowerShell depending if you're on Linux/Mac or Windows.) — Gregor Thomas, Apr 02 '19 at 13:48
@ℕʘʘḆḽḘ On a unix-like system, that could look like `find . -name "*.json" -print | xargs jq -s '.' > onefile.json` — Aurèle, Apr 02 '19 at 16:22
Sounds like a job for a NoSQL DB, like MongoDB. This obviously requires some rewiring, but it will be beneficial in several ways - if this is not just a one-off-job. — wp78de, Apr 02 '19 at 23:32

niko · Accepted Answer · 2019-04-02T13:17:14.420

There are several JSON libraries in R. Here are benchmarks for three of the libraries:

txt <- '{
 "field1": "hello world",
"funny": "yes",
"date": "abc1234",
"field3": "hakuna matata",
"nestedvar": [
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/funny"
],
"othernested": [
{ 
"one" : "two",
"test" : "hello"
}
] 
}'

microbenchmark::microbenchmark(
  jsonlite={
    jsonlite::fromJSON(txt)
  },
  RJSONIO={
    RJSONIO::fromJSON(txt)
  },
  rjson={
    rjson::fromJSON(txt)
  }
)

# Unit: microseconds
#     expr     min       lq      mean  median      uq     max neval cld
# jsonlite 144.047 153.3455 173.92028 167.021 172.491 456.935   100   c
#  RJSONIO 113.049 120.3420 134.94045 128.365 132.742 287.727   100  b 
#    rjson  10.211  12.4000  17.10741  17.140  18.234  59.807   100 a

As you can see, rjson seems to be more efficient (though treat the above results with caution). Personally, I like working with RJSONIO as it is the library that in my experience respects best the formats when reading, modifying and parsing again.

Finally, if you know the (invariant) structure of your files, you can always build a custom JSON reader and maybe be more efficient. But as indicated by @Gregor, maybe you ought to make sure the latency is truly due to the reader.

very interesting. would they all work with a `file` argument? Here you are loading a string. — ℕʘʘḆḽḘ, Apr 02 '19 at 13:07
I'm surprised by the magnitude of the difference, but when the times are in microseconds and OP has "thousands" of these files (not millions), that still doesn't add up to very much time. Very likely reading from disk is the bottleneck, not parsing the JSON. — Gregor Thomas, Apr 02 '19 at 13:08
The different methods act similarly with the `txt` argument: it can be a local file, a file on the web, etc. However, the results are stored differently: `jsonlite` works a lot with data frames, whereas `RJSONIO` works more with lists. @Gregor Yes that makes sense. — niko, Apr 02 '19 at 13:12

how to quickly parse many small JSON files?

1 Answers1