1

I want to insert new json objects in between json objects using bash generated uuid.

input json file test.json

{"name":"a","type":1}
{"name":"b","type":2}
{"name":"c","type":3}

input bash command uuidgen -r

target output json

{"id": "7e3ca7b0-48f1-41fe-9a19-092a62cba0dc"}
{"name":"a","type":1}
{"id": "3f793fdd-ec3b-4306-8153-12f3f9faf2c1"}
{"name":"b","type":2}
{"id": "cbcd759a-37e7-4da7-b7fe-7572f474ec31"}
{"name":"c","type":3}

basic jq program to insert new objects

jq -c '{"id"}, .' test.json

output json

{"id":null}
{"name":"a","type":1}
{"id":null}
{"name":"b","type":2}
{"id":null}
{"name":"c","type":3}

jq program to insert uuid generated from bash:

jq -c '{"id" | input}, .' test.json < <(uuidgen)

Unsure about how to handle two inputs, bash command used to create a value in the new object, and the input file to be transformed (new object inserted in between each object).

I want to process small and large json files up to a few gigabytes each.

Greatly appeaciate some help with a well designed solution(s) that would scale for large files and perform the operations quickly and efficiently.

Thanks in advance.

peak
  • 105,803
  • 17
  • 152
  • 177
Gabe
  • 226
  • 3
  • 13
  • Calling `uuidgen` once per UUID that needs to be generated is probably going to be the slowest part of this when you need to process a large number of records. I'd probably use Python instead, just to be able to do that specific part in-process. – Charles Duffy Nov 21 '20 at 21:36
  • ...if you _really_ want to use jq+uuidgen, I can build an answer that does that, but it's going to be a lot slower than would be optimal, even if the jq parts themselves are fast. – Charles Duffy Nov 21 '20 at 21:39
  • Have a wide range of json files to process with each one needing a header object containg a uuid to conform to a downstream application. Struggling to get the basic function working. – Gabe Nov 21 '20 at 21:39
  • It would really help me in the short term, before we build a more elegant solution – Gabe Nov 21 '20 at 21:40
  • Any chance you would consider helping me with both? – Gabe Nov 21 '20 at 21:41
  • jq would get me going today, while upskilling my python – Gabe Nov 21 '20 at 21:42
  • Charles any chance you could post your python solution? – Gabe Nov 21 '20 at 21:53
  • I already did, 5 minutes ago now. If you don't see it, refresh the page. – Charles Duffy Nov 21 '20 at 21:53
  • 1
    (BTW, as a bit of context for the interplay here -- peak is one of Stack Overflow's best jq experts, and is someone I learn a lot from; insofar as we're competing, the competition is a very friendly one). – Charles Duffy Nov 21 '20 at 22:14
  • I am privealegd to have such great help, thank you gentleman... (tips hat) – Gabe Nov 21 '20 at 22:30

4 Answers4

2

If the input file is already well-formed JSONL, then a simple bash solution would be:

while IFS= read -r line; do
  printf "{\"id\": \"%s\"}\n" $(uuidgen)
  printf '%s\n' "$line"
done < test.json

This might well be the best trivial solution if test.json is very large and known to be valid JSONL.

If the input file is not already JSONL, then you could still use the above approach by piping in jq -c . test.json. And if ‘read’ is too slow, you could still use the above text-processing approach with awk.

For the record, a single-call-to-jq solution along the lines you have in mind could be constructed as follows:

jq -n -c -R --slurpfile objects test.json '
  $objects[] | {"id": input}, .' <(while true ; do uuidgen ; done)

Obviously you cannot "slurp" the unbounded stream of uuidgen values; less obviously perhaps, if you were simply to pipe in the stream, the process will hang.

peak
  • 105,803
  • 17
  • 152
  • 177
  • This is quite close to what I was thinking of doing, but I was going to try to avoid slurpfile since I expected (perhaps wrongly?) that it would be reading everything into RAM, and thus unsuitable for multi-GB content. What _are_ the characteristics in that context? – Charles Duffy Nov 21 '20 at 21:42
  • That is probably fine for small-GB content, it might be more challenging for 10s-GB content – Gabe Nov 21 '20 at 21:44
  • that worked well enough thank you, see what you mean about the uuidgen time overhead – Gabe Nov 21 '20 at 21:47
1

Since @peak has already covered the jq side of the problem, I'm going to take a shot at doing this more efficiently using Python, still wrapped so it can be called in a shell script.

This assumes that your input is JSONL, with one document per line. If it isn't, consider piping through jq -c . before piping into the below.

#!/usr/bin/env bash

py_prog=$(cat <<'EOF'
import json, sys, uuid

for line in sys.stdin:
    print(json.dumps({"id": str(uuid.uuid4())}))
    sys.stdout.write(line)
EOF
)

python -c "$py_prog" <in.json >out.json
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

If the input was not known in advance to be valid JSONL, one of the following bash+jq solutions might make sense since the overhead of counting the number of objects would be relatively small.

If the input is small enough to fit in memory, you could go with a simple solution:

n=$(jq -n 'reduce inputs as $in (0; .+1)' test.json)

for ((i=0; i < $n; i++)); do uuidgen ; done |
jq -n -c -R --slurpfile objects test.json '
  $objects[] | {"id": input}, .'

Otherwise, that is, if the input is very large, then one could avoid slurping it as follows:

n=$(jq -n 'reduce inputs as $in (0; .+1)' test.json)
jq -nc --rawfile ids <(for ((i=0; i < $n; i++)); do uuidgen ; done) '
  $ids | split("\n") as $ids
  | foreach inputs as $in (-1; .+1; {id: $ids[.]}, $in)
' test.json 
peak
  • 105,803
  • 17
  • 152
  • 177
  • Is there much time saved? I would have expected the pipe buffer to be small enough that the number of unnecessary `uuidgen` calls with the infinite loop in your original answer would be fairly minimal. – Charles Duffy Nov 21 '20 at 22:03
  • thank you again, delving in and testing, i guess once the count is known then bulk uuidgen might be done indepently, still processing.. – Gabe Nov 21 '20 at 22:04
  • @CharlesDuffy - Only time will tell :-). jq only allows you to use inputs on one stream, so in general, unless the JSON stream is already JSONL or some such, there's really no avoiding "slurping" one of the two streams with just one call to jq. If the objects were large, then it would obviously be better to slurp the ids. – peak Nov 21 '20 at 22:23
1

Here's another approach where jq is handling input as raw string, already muxed by a separate copy of bash.

while IFS= read -r line; do
  uuidgen
  printf '%s\n' "$line"
done | jq -Rrc '({ "id": . }, input)'

It still has all the performance overhead of calling uuidgen once per input line (plus some extra overhead because bash's read operates one byte at a time) -- but it operates in a fixed amount of memory without needing Python.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • @peak, re: "really no avoiding slurping one of the two streams", consider this a counterexample :) – Charles Duffy Nov 21 '20 at 22:11
  • jq isn't doing much here. You might as well just write: `while IFS= read -r line; do printf "{\"id\": \"%s\"}\n" $(uuidgen) printf '%s\n' "$line" done` – peak Nov 21 '20 at 23:21