14

Similar questions asked here before:

Count items for a single key: jq count the number of items in json by a specific key

Calculate the sum of object values: How do I sum the values in an array of maps in jq?

Question

How to emulate the COUNT aggregate function which should behave similarly to its SQL original? Let's extend this question even more to include other regular SQL functions:

  • COUNT
  • SUM / MAX/ MIN / AVG
  • ARRAY_AGG

The last one is not a standard SQL function - it's from PostgreSQL but is quite useful.

At input comes a stream of valid JSON objects. For demonstration let's pick a simple story of owners and their pets.

Model and data

Base relation: Owner

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31

Base relation: Pet

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4

Source

From above we get a derivative relation Owner_Pet (a result of SQL JOIN of the above relations) presented in JSON format for our jq queries (the source data):

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }

Requests

Here are sample requests and their expected output:

  • COUNT the number of pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
  • SUM up the number of whelps per owner and get their MAX (MIN/AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
  • ARRAY_AGG pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
peak
  • 105,803
  • 17
  • 152
  • 177
Onkeltem
  • 1,889
  • 1
  • 18
  • 27

3 Answers3

9

Here's an alternative, not using any custom functions with basic JQ. (I took the liberty to get rid of redundant parts of the question)

Count

In> jq -s 'group_by(.owner_id) |  map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]

Sum

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]

Max

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]

Aggregate

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]

Sure, these might not be the most efficient implementations, but they show nicely how to implement custom functions oneself. All that changes between the different functions is inside the last map and the function after the pipe | (length, add, max)

The first map iterates over the different groups, taking the name from the first item, and using map again to iterate over the same-group items. Not as pretty as SQL, but not terribly more complicated.

I learned JQ today, and managed to do this already, so this should be encouraging for anyone getting started. JQ is neither like sed nor like SQL, but not terribly hard either.

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
  • Fantastic!!! Here's my aggregate function to all bitcoin transactions by a given wallet address `[.transactions | map(select(.category == "receive")) | map(select(.confirmations >= 6)) | map(select(.amount > 0)) | sort_by(.time) | .[] | .] | group_by(.address) | map({ address: .[0].address, count: map(.txid) | length, total: map(.amount) | add, max: map(.amount) | max, min: map(.amount) | min})` – Marcello DeSales Apr 10 '22 at 16:31
3

Extended jq solution:

Custom count() function:

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Custom sum() function:

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}

Custom array_agg() function:

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
3

This is a nice exercise, but SO is not a programming service, so I will focus here on some key concepts for generic solutions in jq that are efficient, even for very large collections.

GROUPS_BY

The key to efficiency here is avoiding the built-in group_by, as it requires sorting. Since jq is fundamentally stream-oriented, the following definition of GROUPS_BY is likewise stream-oriented. It takes advantage of the efficiency of key-based lookups, while avoiding calling tojson on strings:

# emit a stream of the groups defined by f
def GROUPS_BY(stream; f): 
  reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

distinct and count_distinct

# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream): 
  reduce stream as $x ({};
      ($x|type) as $t
      | (if $t == "string" then $x else ($x|tojson) end) as $y
      | if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
   | [.[][]] | add ;


# Emit the number of distinct items in the given stream
def count_distinct(stream):
   def sum(s): reduce s as $x (0;.+$x);
   reduce stream as $x ({};
       ($x|type) as $t
       | (if $t == "string" then $x else ($x|tojson) end) as $y
       | .[$t][$y] = 1 )
   | sum( .[][] ) ;

Convenience function

def owner: {owner_id,owner,age};

Example: "COUNT the number of pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}

Invocation: jq -nc -f program1.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Example: "SUM up the number of whelps per owner and get their MAX"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
  + {litter_total: (map(.litter) | add)}
  + {litter_max:  (map(.litter) | max)}

Invocation: jq -nc -f program2.jq input.json

Output: as given.

Example: "ARRAY_AGG pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}

Invocation: jq -nc -f program3.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
peak
  • 105,803
  • 17
  • 152
  • 177
  • Thank you very much Peak. I'm new to this language and it's hard for me to find a bug, but currently I don't get the expected results. For example, your count returns 1 for Adam, not 2. Totals and max returns different results either. – Onkeltem Jan 19 '18 at 06:20
  • 1. GROUPS_BY was not properly terminated by a semicolon. You must have added it to get any results at all. 2. The discrepancy probably arises because you omitted the -n option that is required when using `inputs` as here. I'll make that explicit. – peak Jan 19 '18 at 07:28