3

My use case is similar to this entry, in wanting to read an inner, huge array (multiple gigabytes as text) from within a JSON object such as:

{ "a": "...",   // root level fields to be read, separately
  ...
  "bs": [       // the huge array, most of the payload (can be multiple GB's)
    {...},
    ...
  ]
}

The input is available as a Source[ByteString,_] (Akka stream), and I'm using Circe for JSON decoding, elsewhere.

I can see two challenges:

  1. Reading the bs array in a streamed fashion (getting a Source[B,_] for consuming it).

  2. Splitting the original stream to two, so I can read and analyse the root level fields before the array begins.

Do you have pointers to solving such a use case? I have checked akka-stream-json and circe-iteratee, so far.

akka-stream-json looks like the thing, but is not very maintained. circe-iteratee does not seem to have integration with Akka Streams.

akauppi
  • 17,018
  • 15
  • 95
  • 120

2 Answers2

1

Jawn has an async parser: https://github.com/non/jawn/blob/master/parser/src/main/scala/jawn/AsyncParser.scala

But it is hard to write an efficient async parser for JSON because of its sequential origin.

If you can switch to the synchronous parsing then you can use jsoniter-scala-core and write a simple custom codec which will skip all not needed key/value pairs except "bs" and then parse required data blazingly fast without holding or array content in memory.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Andriy Plokhotnyuk
  • 7,883
  • 2
  • 44
  • 68
  • 1
    akka-stream-json builds on top of the Jawn. The problem I am facing is reading in a >5GB JSON file, where the bulk is a single array. – akauppi Jul 23 '18 at 07:50
0

I can see this needing a whole new library, for streamed JSON decoding.

Something like:

case class A(a: Int, bs: Source[B,_])

val src: Source[ByteString,_] = ???
src.as[A]

My interim solution is to "massage" the JSON by jq and sed, so that each B is on their own line. This way, I can consume the source line-wise and decode each B separately.

Here's the Bash script (with no guarantees):

#!/bin/bash

arrKey=$1
input=$2

head -n 1 $input | sed s/.$//
jq -M -c ".$arrKey|.[]" $input | sed s/$/,/
echo "]}"

It does rely on certain things, e.g. the non-array matter always being on the first line (which they are).

akauppi
  • 17,018
  • 15
  • 95
  • 120