1

Would like to SLICE a huge json file ~20GB into smaller chunk of data based on array size (10000/50000 etc)..

Input:

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},
     {"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"},
     {"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},
     {"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}
 ]
}

Currently running in a loop to get the desire output by incrementing x/y value, but performance is very slow and takes very 8-20 seconds for a iteration depends on size of the file to complete the split process. Currently using 1.6 version, is there any alternates for getting below result

Expected Output: for Slice of 2 objects in array

{"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]}
{"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}

Tried with

cat $inFile | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile

cat $inFile | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile  

Please share if there are any alternate available..

chepner
  • 497,756
  • 71
  • 530
  • 681
Ilan
  • 41
  • 4

2 Answers2

0

In this response, which calls jq just once, I'm going to assume your computer has enough memory to read the entire JSON. I'll also assume you want to create separate files for each slice, and that you want the JSON to be pretty-printed in each file.

Assuming a chunk size of 2, and that the output files are to be named using the template part-N.json, you could write:

< input.json jq -r --argjson size 2 '
  del(.add) as $object
  | (.add|_nwise($size) | ("\t", $object + {add:.} ))
' | awk '
      /^\t/ {fn++; next}
      { print >> "part-" fn ".json"}'

The trick being used here is that valid JSON cannot contain a tab character.

peak
  • 105,803
  • 17
  • 152
  • 177
  • Thank you so much.. It's much much faster.. I have added -c before -r as well additionally. – Ilan Jan 08 '22 at 01:04
  • Is there any other better way to increase the performance when we try to add unique_by to this? Post adding unique_by, taking more time to execute.. < input.json jq -r --argjson size 2 ' .add |= unique_by({rngNum,state,postal}) | del(.add) as $object | (.add|_nwise($size) | ("\t", $object + {add:.} )) ' | awk ' /^\t/ {fn++; next} { print >> "part-" fn ".json"}' – Ilan Jan 26 '22 at 00:36
  • The built-in `unique_by` is computationally inefficient (it involves a sort). For a sort-free version of `unique` (and thence a sort-free version of unique_by) see the jq Cookbook: https://github.com/stedolan/jq/wiki/Cookbook#using-bag-to-implement-a-sort-free-version-of-unique – peak Jan 26 '22 at 03:28
  • Again thank you very much for pointing on direction, current implementation we don't have it as streams. Would like to lean towards injecting an additional element in json array with autogenerated number. this would help in filtering them out at backend.. With above example could you assist/direct in adding an incremental integer value inside the add array like (row 1, 2,.. ) {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"row":1,"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"row":2,"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]} – Ilan Jan 26 '22 at 03:55
  • It's always easy to switch to a stream-oriented mode. I'd suggest asking a new SO question if you need further details about sort-free unique_by or autogenerated numbers. – peak Jan 26 '22 at 05:39
  • https://stackoverflow.com/questions/70864798/split-slice-large-json-sort-free-unique-by-few-columns-add-additional-element Posted a new question on SO – Ilan Jan 26 '22 at 14:12
0

The following assumes the input JSON is too large to read into memory and therefore uses jq's --stream command-line option.

To keep things simple, I'll focus on the "slicing" of the .add array, and won't worry about the other keys, or pretty-printing, and other details, as you can easily adapt the following according to your needs:

< input.json jq -nc --stream --argjson size 2 '
  def regroup(stream; $n):
    foreach (stream, null) as $x ({a:[]};
      if $x == null then .emit = .a
      elif .a|length == $n then .emit = .a | .a = [$x]
      else .emit=null | .a += [$x] end;
      select(.emit).emit);

    regroup(fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "add")) );
            $size)' |
  awk '{fn++; print > fn ".json"}'

This writes the arrays to files with filenames of the form N.json

peak
  • 105,803
  • 17
  • 152
  • 177
  • Thanks for your response, will try this out by adding other parameter's as well. For now, haven't experienced any memory issues in reading the file, but certainly will compare the performance of both. Thank you. – Ilan Jan 08 '22 at 01:06