R is very slow reading in .jsonl files

Question

I need to read .jsonl files in to R, and it's going very slowly. For a file that's 67,000 lines, it took over 10 minutes to load. Here's my code:

library(dplyr)
library(tidyr)
library(rjson)

f<-data.frame(Reduce(rbind, lapply(readLines("filename.jsonl"),fromJSON)))
f2<-f%>%
  unnest(cols = names(f))

Here's a sample of the .jsonl file

{"UID": "a1", "str1": "Who should win?", "str2": "Who should we win?", "length1": 3, "length2": 4, "prob1": -110.5, "prob2": -108.7}
{"UID": "a2", "str1": "What had she walked through?", "str2": "What had it walked through?", "length1": 5, "length2": 5, "prob1": -154.6, "prob2": -154.8}

So my questions are: (1) Why is this taking so long to run, and (2) How do I fix it?

score 3 · Accepted Answer · answered Oct 21 '19 at 15:19

I think the most efficient way to read in json lines files is to use the stream_in() function from the jsonlite package. stream_in() requires a connection as input, but you can just use the following function to read in a normal text file:

read_json_lines <- function(file){
  con <- file(file, open = "r")
  on.exit(close(con))
  jsonlite::stream_in(con, verbose = FALSE)
}

NaN · Answer 2 · 2020-12-09T14:01:22.383

You can also check out ndjson. It is a wrapper around Niels Lohmann's super convenient C++ json lib. The interface is similar to jsonlite:

df <- ndjson::stream_in('huge_file.jsonl')

Alternatively, you can parallelize it. Sure, it depends on your specific setting (e.g., CPU, HDD, file), but you can give it a try. I am working quite often on BigQuery dumps. In case of bigger tables, the output is splitted across files. This allows to parallelize it on file level (read & parse multiple files in parallel and merge the outputs):

library(furrr)

# my machine has more than 30 cores and a quite fast SSD
# Therefore, it utilises all 20 cores
plan(multisession, workers = 20)

df <- future_map_dfr(
   # this returns a list containing all my jsonline files
   list.files(path = "../data/panel", pattern="00*", full.names=T),
   # each file is parsed separately 
   function(f) jsonlite::stream_in(file(f))
)

R is very slow reading in .jsonl files

2 Answers2