3

I have a 19 gigs JSON file. A huge array of rather small objects.

[{
  "name":"Joe Blow",
  "address":"Gotham, CA"
  "log": [{},{},{}]
},
...
]

I want to iterate thru the root array of this JSON. Every object with the log takes no more than 2MB of memory. It is possible to load one object into a memory, work with it and throw it away.

Yet the file by itself is 19 gigs. It has millions of those objects. I found it is possible to iterate thru such an array by using C# and Newtonsoft.Json library. You just read a file in a stream and as soon as you see finished object, serialize it and spit it out.

But I want to see if the powershell can do the same? Not to read the whole thing as one chunk, but rather iterate what you have in the hopper right now.

Any ideas?

js2010
  • 23,033
  • 6
  • 64
  • 66
NewRK
  • 409
  • 3
  • 15
  • You should still be able to use the `Newtonsoft.Json`, almost anything you can do in .NET you can do is Powershell. However, if you're working with a ***19 gig JSON file*** you may want to re-think your strategy. A 19GB object will take up 19 gigs of memory and unless you have an ungodly amount of memory, you'll end up swapping and the performance of your script or program will be terrible. This would also impact the performance of other processes on that host. For a dataset of this size I would recommend a database rather than a flat file like this. – codewario Dec 19 '19 at 04:56
  • Thanks for a tip. Believe me, I'm rethinking my strategy now. That's a part of this re-think. This JSON is being split, flattened and shipped to an Elasticsearch cluster. It should be there in a first place. You know how it is - to inherit someone's database? – NewRK Dec 19 '19 at 05:00
  • ES is also a good choice for this, but if the data is long-lived don't use it as a permanent datastore. – codewario Dec 19 '19 at 05:20
  • Is streaming not possible in Powershell/C#? I think it‘s not state of the art anymore to just load a whole file into the memory. – moritz.vieli Dec 26 '19 at 02:11

1 Answers1

3

As far as I know, convertfrom-json doesn't have a streaming mode, but jq does: Processing huge json-array files with jq. This code will turn a giant array into just the contents of the array, that can be output piece by piece. Otherwise a 6mb, 400000 line json file can use 1 gig of memory after conversion (400 megs in powershell 7).

get-content file.json | 
  jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | 
  % { $_ | convertfrom-json }

So for example this:

[
  {"name":"joe"},
  {"name":"john"}
]

becomes this:

{"name":"joe"}
{"name":"john"}

The streaming format of jq looks very different from json. For example, the array looks like this, with paths to each value and object or array end-markers.

'[{"name":"joe"},{"name":"john"}]' | jq --stream -c

[[0,"name"],"joe"]
[[0,"name"]]        # end object
[[1,"name"],"john"]
[[1,"name"]]        # end object
[[1]]               # end array

And then after truncating "1" "parent folder" in the path of the two values:

'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream '1|truncate_stream(inputs)'     

[["name"],"joe"]
[["name"]]           # end object
[["name"],"john"]
[["name"]]           # end object
                     # no more end array

"fromstream()" turns it back into json...

'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream 'fromstream(1|truncate_stream(inputs))'  

{"name":"joe"}
{"name":"john"}
js2010
  • 23,033
  • 6
  • 64
  • 66