0

How to read a big ndjson (20GB) file by chunk into R?

I have a big data file that I want to read 1M rows at a time.

currently, I'm using below code to load data into R.

jsonlite::stream_in(
  file(fileName)
)

But I don't need to load all data together. how can I split this file to chunk to load faster?

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Mahdi Jadaliha
  • 1,947
  • 1
  • 14
  • 22
  • I'd highly suggest using Apache Drill along with the sergeant package https://rud.is/books/drill-sergeant-rstats/reading-a-streaming-json-ndjson-data-file-with-drill-r.html. In fact, I'd highly suggest converting the ndjson to parquet with Drill and use the parquet version in Drill via the sergeant package. – hrbrmstr Nov 12 '18 at 20:51

1 Answers1

1

If you don't want to level-up and use Drill, this will work on any system zcat (or gzcat) and sed live:

stream_in_range <- function(infile, start, stop, cat_kind = c("gzcat", "zcat")) {

  infile <- path.expand(infile)
  stopifnot(file.exists(infile))

  gzip <- (tools::file_ext(infile) == "gz")
  if (gzip) cat_kind <- match.arg(cat_kind, c("gzcat", "zcat"))

  start <- as.numeric(start[1])
  stop <- as.numeric(stop[1])

  sed_arg <- sprintf("%s,%sp;", start, stop, (stop+1))

  sed_command <- sprintf("sed -n '%s'", sed_arg)

  if (gzip) {
    command <- sprintf("%s %s | %s ", cat_kind, infile, sed_command)
  } else {
    command <- sprintf("%s %s", sed_command, infile)
  }

  ndjson::flatten(system(command, intern=TRUE), "tbl")

}

stream_in_range("a-big-compressed-ndjson-file.json.gz", 100, 200)

stream_in_range("a-big-uncompressed-nsjdon-file.json", 1, 10)

Choose and/or add a different cat_kind for whatever works for you.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 1
    Thanks, @hrbrmstr. your answer helped a lot. I add this here for Windows users that you may install Cygwin as well to use `sed` function in `system(command, intern=TRUE)`. I have one more question: I see that you are the author of **ndjson**. Is this possible to add skip parameters to `ndjson::stream_in` function? – Mahdi Jadaliha Nov 13 '18 at 18:20
  • sure. can you post that as an issue? i thought Rtools.exe install came w/`sed` but I don't run Windows so I'm very likely wrong abt that. – hrbrmstr Nov 13 '18 at 18:53