2

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.

maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max') 

Please help.Thanks! I am new to shell scripting,json,jq.

sample/input json file contents:

{
    "results": [
        {
            "Id": "123",
            "date": 1588910400000,
            "col": "test"
        },
        {
            "Id": "1234",
            "date": 1588910412345,
            "col": "test2"
        }
    ],
    "col2": 123
}
peak
  • 105,803
  • 17
  • 152
  • 177
SimpleBI
  • 23
  • 4

3 Answers3

2

Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:

[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...

Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:

jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
oguz ismail
  • 1
  • 16
  • 47
  • 69
1

500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:

# max_by(empty;1) yields null
def max_by(s; f):
  reduce s as $s (null;
    if . == null then {s: $s, m: ($s|f)}
    else  ($s|f) as $m
    | if $m > .m then {s: $s, m: $m} else . end
    end)
  | .s ;

With this in our toolkit, we can simply write:

max_by(.results[].date; .)

This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)


(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.

(**) The absence of a "date" subfield should not matter as null is less than every number.

peak
  • 105,803
  • 17
  • 152
  • 177
-1
 jq '.results | max_by(.date) | .date' "$jsonfilePath"

is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • Unfortunately, the current max_by/1 is defined using map - see https://github.com/stedolan/jq/blob/master/src/builtin.jq – peak May 09 '20 at 16:24