Efficiently parsing large JSON files in Haskell

Question

I have a large JSON file (about 90MB) that contains an homogeneous array of objects. I am trying to write a Haskell program that reduces the values in the array. This seems like a good candidate for lazy evaluation - the program shouldn't have to read each object from the file until the previous object has been processed.

I've profiled the memory usage of the program when using the Data.Aeson and Text.JSON packages, and it seems that the entire file is being parsed and a full abstract-syntax-tree is being constructed in one pass before the array can processed by the rest of the program. Probably because the parse takes place in the Maybe (or Either or Result) monad, and it isn't known whether the parse will return Just or Nothing (or equivalents) until the complete AST has been built. This gives worryingly high memory usage, and causes space overflows in most cases.

Do any libraries support two-pass parsing? A first parse which determines whether the file CAN be parsed into the expected types, and then a second lazy parse that reads more of the file as it is needed?

Or is there a better way of solving this problem?

Use either `conduit` or `pipes` library to solve this problem. — Sibi, May 12 '14 at 11:33
"Probably because the parse takes place in the Maybe (or Either or Result) monad, and it isn't known whether the parse will return Just or Nothing (or equivalents) until the complete AST has been built." That should not be the case if you are folding in a single pass. When such things happen often there is something else in the code holding onto the input data. It will be easier to tell if you add the relevant bits of code to the question, though. — duplode, May 12 '14 at 12:04
That makes sense actually. It's not an explicit fold, I'm actually updating an IOArray with data from each object, so at the moment it's actually a mapM_. But I just realised my function to calculate the size of the IOArray depends on finding a maximum value across all the objects in the input array. Maybe if I use an IOVector instead this step can be skipped. Thanks for your help, sorry to waste your time on something (probably) trivial. — immutablestate, May 12 '14 at 12:27
Perhaps you don't actually need arrays and `IO`. Have you tried to use the maps from `containers` or `unordered-containers` instead? It might make things much easier, specially if the output data is relatively small compared to the huge input. — duplode, May 12 '14 at 12:40
If your JSON file is of the form "foo: [item,item,item...]" then you are basically stuck with generating the entire AST, because otherwise the parser might get to the end, encounter an error, and thus have to return the error instead of the objects. Can you strip the head and tail off the file to turn it into a sequence JSON objects that can be individually parsed? — Paul Johnson, May 12 '14 at 18:07

score 1 · Accepted Answer · answered May 12 '14 at 19:34

1

To my knowledge, the only streaming JSON parser on Hackage right now is yajl-enumerator. I've discussed creating a streaming JSON parsing/rendering library in the past, but I've yet to have a strong enough need (or enough demand) to do so. I would definitely be in favor of the existence of such a library, and would be happy to assist in getting it written.

answered May 12 '14 at 19:34

Michael Snoyman

31,100
3
48
77

And the streaming library was written: https://github.com/ondrap/json-stream The last version will make it to hackage within a few days. – ondra Apr 20 '15 at 13:34

Efficiently parsing large JSON files in Haskell

1 Answers1