Processing huge json-array files with jq

Question

I have huge (~7GB) json array of relatively small objects.

Is there relatively simple way to filter these objects without loading whole file into memory?

--stream option looks suitable, but I can't figure out how to fold stream of [path,value] to original objects.

Small world. I've just come up against a similar problem. Out of interest, is the whitespace in your json file predictable? For example, large json arrays often use one line per top level array item? — Tom, Aug 24 '15 at 13:14
If the file is already regularly formatted, then you might want to consider using text-wrangling tools to convert the file into a stream of small objects, which could then be processed using jq. If the file is not already suitably formatted, then if it is acceptable to use jq on the whole file just once, then you might consider using `jq .` to format the JSON to make it easy to convert into such a stream. — peak, Aug 30 '16 at 09:48

peak · Accepted Answer · 2019-03-05T15:03:13.013

15

jq 1.5 has a streaming parser. The jq FAQ gives an example of how to convert a top-level array of JSON objects into a stream of its elements:

$ jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
[{"foo":"bar"},{"foo":"baz"}]
{"foo":"bar"}
{"foo":"baz"}

That may be enough for your purposes, but it is worthwhile noting that setpath/2 can be helpful. Here's how to produce a stream of leaflets:

jq -c --stream '. as $in | select(length == 2) | {}|setpath($in[0]; $in[1])'

Further information and documentation is available in the jq manual: https://stedolan.github.io/jq/manual/#Streaming

edited Mar 05 '19 at 15:03

answered Aug 24 '15 at 18:23

peak

105,803
17
152
177

The `--stream` option looks promising, however I'm still unsure how to use it. For example, how will the streaming parser detect a missing closing bracket at the end? – hek2mgl Aug 24 '15 at 20:35
Thanks, looks like this is intended way of solving my problem. Unfortunately looks like jq has a [bug](https://github.com/stedolan/jq/issues/927) that prevents me of using it. I ended up with writing own little json parser that did the same job. Also I found that my simple python script that greps interesting entries works faster than jq (may be I use jq improperly). So I'm quite disapointed in this tool :( – Dmitry Ermolov Aug 25 '15 at 13:34
@dim-an - My bad about the -n being necessary when inputs is used. The --stream option is new so I'm not surprised it is fairly slow. As I mentioned elsewhere on this page, if I had been in your position (as indeed I have been), I would probably have simply clipped the first and last brackets, or focused on figuring out how to use jq once to wrangle your gigantic one-object JSON into a more manageable form. – peak Aug 26 '15 at 03:41
... or doing it the right way, like I've suggested! ;) – hek2mgl Aug 26 '15 at 08:38
I need to do a `group_by(col)` on a json of around ~7GB size and the col contains around 3m distinct values. What would be the best way to do this? – nishant Aug 14 '17 at 06:39
@NishantKumar - I would suggest asking a new SO question, giving more specific details, as per https://stackoverflow.com/help/mcve – peak Aug 14 '17 at 07:45
@dim-an any chance you could share your Python parser? I have similar use case to solve and your example would help me a lot. Thank you! – Jarek Sep 29 '20 at 09:43
@Jarek Oh, I'm afraid that script is lost long time ago and this parser was not generic in any way and was tuned for my special case. I suppose I searched for `}` char and considered it as object boundary (my json objects were really simple), then knowing object boundary I could do some processing (may be using builtin json parser). Once again it was throw-away parser that I used for some investigation. – Dmitry Ermolov Sep 29 '20 at 16:43

Processing huge json-array files with jq

1 Answers1

Linked