2

per Split/Slice large JSON using jq we are able to successfully slice huge input file into smaller chunk of data based on array size..

Would like to add a new json element to it with incrementing sequence number based on length of original array along with filter/unique per few columns.

Input:

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},
     {"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77830"},
     {"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},
     {"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77832"}
 ]
}

Expected Output: After performing adding of additional key

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"rownum":1,"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},
     {"rownum":2,"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77830"},
     {"rownum":3,"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},
     {"rownum":4,"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77832"}
 ]
}

After performing filter (by State, City, Postal) and slice per array size of 2

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"rownum":1,"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},
     {"rownum":3,"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"}]}

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"rownum":4,"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77832"}
 ]
}

Below sample was used to filer/unique by few columns, not attaining optimal performance

input.json jq -r --argjson size 2 ' .add |= unique_by({city,state,postal}) | del(.add) as $object | (.add|_nwise($size) | ("\t", $object + {add:.} )) ' | awk ' /^\t/ {fn++; next} { print >> "part-" fn ".json"}'
dawg
  • 98,345
  • 23
  • 131
  • 206
Ilan
  • 41
  • 4
  • Not a free coding service. What have you done and what specifically are the issues you are having. Try yourself first. – dawg Jan 26 '22 at 14:37
  • 2
    Not looking for any free coding service.. if you have looked at earlier post referenced here, tried code is very much shared.. input.json jq -r --argjson size 2 ' .add |= unique_by({rngNum,state,postal}) | del(.add) as $object | (.add|_nwise($size) | ("\t", $object + {add:.} )) ' | awk ' /^\t/ {fn++; next} { print >> "part-" fn ".json"}' this is not optimal for performance. Need to see how this could be changed/tweaked to attain better performance. – Ilan Jan 26 '22 at 14:59
  • 1
    It's not too late to fold the additional information into the text of the Q here. – peak Jan 26 '22 at 15:02
  • You could consider using a tokenizer that will logically split JSON objects. [Here](https://github.com/rickardp/splitstream) is an example. – dawg Jan 26 '22 at 16:30
  • dwag, thanks for the reference. Looking more towards sh solution.. thanks – Ilan Jan 26 '22 at 17:46

2 Answers2

0

One could use

.add |= [ range(length) as $i | .[$i] | .rownum = $i+1 ]

Demo on jqplay

or

.add |= ( to_entries | map( .value.rownum = .key+1 | .value ) )

Demo on jqplay

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • thanks for your response. this fits for the first half of the situation. Major concern with the unique_by was on the performance. hence looking to see if there could be any tweaking can be done to this. Thanks again – Ilan Jan 27 '22 at 00:49
0

Here's a solution that uses two general-purpose filters - one for enumerating, and the second for a sort-free and stream-oriented variant of unique_by:

  # counting from 1
  def enumerate(s; $key): foreach s as $x (0; .+1; {($key): .} + $x);

  # emits a stream of the first item, $x, in the stream for which f assumes the value ($x|f).
  def uniques_by(stream; f): 
    reduce stream as $x ({};
      ($x|f) as $s
      | ($s|type) as $t
      | (if $t == "string" then $s else ($s|tojson) end) as $y
      | if .[$t] | has($y) then . else .[$t][$y] = $x end )
    | .[][] ;

  .add |= [enumerate(uniques_by(.[]; {city,state,postal}); "rownum")]
  | del(.add) as $object
  | (.add|_nwise($size) | ("\t", $object + {add:.} ))
peak
  • 105,803
  • 17
  • 152
  • 177
  • Thanks for two options. I guess, the second option is the initial/original trial and experienced performance issues.. Can you please elaborate on the first option with stream of item, is there anywhere we could provide list of columns we are interested with or it would always be taking the first matching row for that stream of inputs? Please clarify – Ilan Jan 27 '22 at 00:40
  • Sorry, I don't understand most of the above comment. Rest assured, though, that where there's a will, there's a jq way. – peak Jan 27 '22 at 02:48