2

I have very large json I want to extract sum of data from it using jq.

I am trying every possible way but I think I am missing something here..

My Json subset:

{"main":
  {"0": {"x": {"a":1}, "y": {"number of un-used":{"count":2} , "z":2}},
  "1": {"x": {"a":1}, "y": {"number of un-used":{"count":3} , "z":2}},
  "2": {"x": {"a":1}, "y": {"number of un-used":{"count":4} , "z":2}},
  "3": {"x": {"a":1}, "y": {"no un-used":{"z":3} , "z":2}},
  "4": {"x": {"a":1}, "y": {"no un-used":{"z":3} , "z":2}}},
 "no-main":
  {"0": {"x": {"a":1}, "y": {"number of un-used":{"count":2} , "z":2}},
  "1": {"x": {"a":1}, "y": {"number of un-used":{"count":3} , "z":2}},
  "2": {"x": {"a":1}, "y": {"number of un-used":{"count":4} , "z":2}},
  "3": {"x": {"a":1}, "y": {"no un-used":{"z":3} , "z":2}},
  "4": {"x": {"a":1}, "y": {"no un-used":{"z":3} , "z":2}}}}

I want the sum of "count" - "number of un-used" that is under "y".

My most successful one is:

cat json | jq '.[] | .[].y | .["number of un-used"] | .count'

But the results contains a lot of "null" because "number of un-used" is not in all of "y" dicts..

Is it solvable?

peak
  • 105,803
  • 17
  • 152
  • 177
gabi
  • 1,003
  • 5
  • 12
  • 30

3 Answers3

3

You can add zero to a null to get ... zero.

jq '.[] | .[].y | ."number of un-used" | .count + 0' < input.json

This produces the list of count values you're looking for, with the null values showing 0. Adding them to your sum should not change it.

jq can do basic arithmetic in its expressions, but having it calculate your sum internally gets a little more complex. While you can define a function that uses a reduce, it would probably be easier to sum these values by appending something to the command line like this:

| awk '{n+=$1}END{print n}'

Of course, there's a zillion ways you could add up numbers, some more hackish than others.

$ { printf '%s+' $(jq '.[] | .[].y | ."number of un-used" | .count + 0' input.json); echo 0; } | bc
18
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • Also note that `.[] | .[].y` could be written as `.main[].y.` – Inian Jan 11 '18 at 12:05
  • 2
    @Inian, thanks, yes that works for the first half of the input in the question, but the other other half, the object associated with the `no-main` string, then gets missed. If you want to handle *all* pairs in the enclosing object, I think the pipe notation is the right way to go. – ghoti Jan 11 '18 at 15:01
0

Since the JSON is very large, you might want a solution that is overall quite efficient, so it would be worthwhile considering the following:

def sigma(s): reduce s as $x (0; .+$x);
sigam(.[].[].y | .["number of un-used"] | .count)

Notice that this also addresses the problem with nulls. To make the solution even more robust, you could sprinkle in some occurrences of post-fix ? as well.

For humomgously large JSON, you might need a streaming solution, but that would be considerably more complicated, though the above def would still be applicable.

peak
  • 105,803
  • 17
  • 152
  • 177
0

If the JSON is too large to fit comfortably in memory, then jq's streaming parser could be used, as illustrated here.

Invocation

$ jq -n --stream -f program.jq input.json

Note especially the "-n" option.

program.jq

def sigma(s): reduce s as $x (0; .+$x);

sigma(inputs 
      | select(length==2) 
      | select(.[0][-3:] == ["y","number of un-used","count"])
      | .[1] )
peak
  • 105,803
  • 17
  • 152
  • 177