7

I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.

For example:

Let's say I can only deserialize into instances of the following:

case class Data(val a: Int, val b: Int, val c: Int)

and the expected JSON format is:

{   "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ], 
    "bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ], 
     .... MANY ITEMS .... , 
    "qux": [ {"a": 0, "b": 0, "c": 0 }  }

What I would like to do is:

import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.

As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.

What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.

Ryan Delucchi
  • 7,718
  • 13
  • 48
  • 60
  • Will you be pulling in this file multiple times, or is it a one time job? In other words, would a solution with up front processing time but quicker repreated querying make sense? – Chris Pitman Jan 17 '13 at 21:16
  • I would only need to read it in once, so to answer your question: yes. – Ryan Delucchi Jan 17 '13 at 21:37
  • This is bit unusual data format but I guess it's due to processing style (map/reduce?) -- more commonly you'd get a long sequence or array of items, and not huge list of JSON Object properties. This is the main reason why many existing solutions won't work as-is. Jackson, for example supports data-binding iterators via `ObjectMapper.reader().readValues(...)`, where one can iterate over individual values of an array (or root-level sequence). – StaxMan Feb 09 '13 at 05:42

2 Answers2

2

I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.

It is basically a simple Map->Reduce process with the help of stream parser.

Map (your advanceTo)

Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.

Reduce (your stream[Data])

Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).

Bruno Grieder
  • 28,128
  • 8
  • 69
  • 101
  • Interesting thought, and not far from what I'm actually doing right now: which involves using Jerkson in conjunction with util.parsing.input.PagedSeqReader. And you are absolutely correct that each leaf-node of JSON data is pretty small, so I only need to seek to the beginning and then the end of each piece. Once I have my solution worked out, I'll post it. In the meantime, anyone who has a more elegant approach -- I'd like to hear from you. – Ryan Delucchi Jan 17 '13 at 18:37
1

Here is the current way I am solving the problem:

import collection.immutable.PagedSeq
import util.parsing.input.PagedSeqReader
import com.codahale.jerkson.Json
import collection.mutable

private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
private val clearAndStop = ']'

private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
  val str = new StringBuilder()
  var readerFinal = readerInitial

  while(!readerFinal.atEnd && !str.endsWith(text)) {
    str += readerFinal.first
    readerFinal = readerFinal.rest
  }

  if (!str.endsWith(text) || str.contains(clearAndStop))
    Taken(readerFinal, None)
  else
    Taken(readerFinal, Some(str.toString))
}

private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
  var taken = Taken(readerInitial, None)
  chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))

  taken
}

def getJsonData() : Seq[Data] = {
  var data = mutable.ListBuffer[Data]()
  var taken = takeUntil(fileContent, "\"foo\"")
  taken = takeUntil(taken.reader, ':', '[')

  var doneFirst = false
  while(taken.text != None) {
    if (!doneFirst)
      doneFirst = true
    else
      taken = takeUntil(taken.reader, ',')

    taken = takeUntil(taken.reader, '}')
    if (taken.text != None) {
      print(taken.text.get)
      places += Json.parse[Data](taken.text.get)
    }
  }

  data
}

case class Taken(reader: PagedSeqReader, text: Option[String])
case class Data(val a: Int, val b: Int, val c: Int)

Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
Ryan Delucchi
  • 7,718
  • 13
  • 48
  • 60
  • 1
    If it works, fine... but I have 3 problems with your code: 1) too many vars and while, try using something like `Stream.continually(input.read(buffer)).takeWhile(_ != -1).foreach(...` 2) it does not handle encodings properly: both JSON escapes and character encoding 3) It is entirely specific to your data and hence harder to maintain. You should really try using an existing JSON Stream parser which would mostly solve these 3 problems for you. – Bruno Grieder Jan 18 '13 at 06:01
  • Agreed, and being relatively new to Scala: I actually am not sure exactly *how* to use these JSON Stream parsers in such a way that wont result in the entire file being slurped in and a massive monolithic JSON representation is created. The Stream.continually() construct you introduce is pretty cool for sure - I will have to try it. But for now, the JSON parsing is peripheral to the application so will likely want to table this and revisit later. I'll keep an eye for other posts on this -- but nevertheless, thanks for your insight, BGR. – Ryan Delucchi Jan 18 '13 at 06:33
  • I'm accepting this answer only because it is the most complete answer I have thus far. Of course, I am fully aware that this solution is not without faults, with my most significant take-away being that I will need to look into the proper application of the Stream.continually(input.read(buffer)) idiom. Moreover, once I am ready to dig deeper in streamed JSON parsing, there might be some additional capabilities of such that I am missing. – Ryan Delucchi Feb 04 '13 at 21:10
  • @BrunoGrieder: I'd like to try the existing JSON Stream parser you mention. Where is it? Which one? – gknauth Jan 14 '16 at 23:18
  • @BrunoGrieder Thanks, trying out json4s, will let you know how it goes. – gknauth Jan 15 '16 at 15:45