2

I'm looking for a solution where I'm building out a JSON record and need to generate some text in JQ but pipe this text to an MD5 sum function and use it as a value for a key.

echo '{"first": "John", "last": "Big"}' | jq '. | { id: (.first + .last) | md5 }'

From looking at the manual and the GH issues I can't figure out how to do this since a function can't call out to a shell and there is not built in that provides a unique hash like functionality.

Edit

A better example what I'm looking for is this:

echo '{"first": "John", "last": "Big"}' | jq '. | {first, last, id: (.first + .last | md5) }'

to output:

{
  "first": "John",
  "last": "Big",
  "id": "cda5c2dd89a0ab28a598a6b22e5b88ce"
}

Edit2

and a little more context. I'm creating NDJson files for use with esbulk. I need to generate a unique key for each record. Initially, I thought piping out to the shell would be the simplest solution so I could either use sha1sum or some other hash function easily, but that is looking more challenging than I thought.

A better example what I'm looking for is this:

echo '[{"first": "John", "last": "Big"}, {"first": "Justin", "last": "Frozen"}]' | jq -c '.[] | {first, last, id: (.first + .last | md5) }'

to output:

{"first":"John","last":"Big","id":"cda5c2dd89a0ab28a598a6b22e5b88ce"}
{"first":"Justin","last":"Frozen","id":"af97f1bd8468e013c432208c32272668"}
peak
  • 105,803
  • 17
  • 152
  • 177
Silas Paul
  • 15,857
  • 2
  • 14
  • 11

5 Answers5

3

Using tee allows a pipeline to be used, e.g.:

echo '{"first": "John", "last": "Big"}' |
    tee >( jq -r '.first + .last' | md5 | jq -R '{id: .}') |
    jq -s add

Output:

{
  "first": "John",
  "last": "Big",
  "id": "cda5c2dd89a0ab28a598a6b22e5b88ce"
}

Edit2:

The following uses a while loop to iterate through the elements of the array, but it calls jq twice at each iteration. For a solution that does not call jq at all within the loop, see elsewhere on this page.

echo '[{"first": "John", "last": "Big"}, {"first": "Justin", "last": "Frozen"}]' |
jq -c .[] |
while read -r line ; do
    jq -r '[.[]]|add'  <<< "$line" | md5 |
        jq  --argjson line "$line" -R '$line + {id: .}'
done
peak
  • 105,803
  • 17
  • 152
  • 177
  • updated my question. This gets super close, but can't get it to work with multiple objects in an array. – Silas Paul Jan 31 '18 at 19:47
  • Interesting.... I like your solution better than mine, much easier to read. The method that I created is horribly slow, 200ish records every second. I'm going to try your method and see if it's faster than my method. – Silas Paul Jan 31 '18 at 21:14
  • See separate answer for an efficient solution when the input is an array. – peak Feb 01 '18 at 03:12
  • The Edit2 solution (second version with out tee) was the fasted for 1000 records. It would complete it in 35s . It was also the easiest to digest. – Silas Paul Feb 01 '18 at 15:51
1

Looking around a little farther I ended up finding this: jq json parser hash the field value which was helpful in getting to my answer of:

echo '[{"first": "John", "last": "Big"}, {"first": "Justin", "last": "Frozen"}]' > /tmp/testfile

jsonfile="/tmp/testfile"
jq -c .[] "$jsonfile" | while read -r jsonline ;
do
  # quickly parse the JSON line and build the pre-ID out to get md5sum'd and then store that in a variable
  id="$(jq -s -j -r '.[] | .first + .last' <<<"$jsonline" | md5sum | cut -d ' ' -f1)"
  # using the stored md5sum'd ID we can use that as an argument for adding it to the existing jsonline
  jq --arg id "$id" -s -c '.[] | .id = "\($id)"' <<<"$jsonline"
done

output

{"first":"John","last":"Big","id":"467ffeee8fea6aef01a6ffdcaf747782"}
{"first":"Justin","last":"Frozen","id":"fda76523d5259c0b586441dae7c2db85"}
Silas Paul
  • 15,857
  • 2
  • 14
  • 11
0

jq + md5sum trick:

json_data='{"first": "John", "last": "Big"}'
jq -r '.first + .last| @sh' <<<"$json_data" | md5sum | cut -d' ' -f1 \
| jq -R --argjson data "$json_data" '$data + {id: .}'

Sample output:

{
  "first": "John",
  "last": "Big",
  "id": "f9e1e448a766870605b863e23d3fdbd8"
}
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

Here is an efficient solution to the restated problem. There are altogether just two calls to jq, no matter the length of the array:

json='[{"first": "John", "last": "Big"}, {"first": "Justin", "last": "Frozen"}]'

echo "$json" |
jq -c '.[] | [.[]] | add' |
while read -r line ; do echo "$line" | md5 ; done |
jq -s -R --argjson json "$json" 'split("\n")
  | map(select(length>0))
  | . as $in
  | reduce range(0;length) as $i ($json; .[$i].id = $in[$i])'

This produces an array. Just tack on |.[] at the end to produce a stream of the elements.

Or a bit more tersely, with the goal of emitting one object per line without calling jq within the loop:

jq -c --slurpfile md5 <(jq -c '.[] | [.[]] | add' <<< "$json" |
    while read -r line ; do printf '"%s"' $(md5 <<< "$line" ) ; done) \
 '[., $md5] | transpose[] | .[0] + {id: .[1]}' <<< "$json"

Distinct Digest for each Record

I need to generate a unique key for each record.

It would therefore make sense to compute the digest based on each entire JSON object (or more generally, the entire JSON value), i.e. use jq -c ‘.[]’

peak
  • 105,803
  • 17
  • 152
  • 177
  • The first solution doesn't work because the JSON file is very large, over 300mb and results in `Argument list too long`. The second solution isn't faster than the https://stackoverflow.com/revisions/48549720/4 Edit 2 solution you proposed. I'm going to roll it back to that and mark that as my answer. – Silas Paul Feb 01 '18 at 15:46
  • Obviously if the JSON is in a file, you wouldn’t want to stuff it into a variable. If you can make the Edit2 solution work, it should be trivial to make the “fast” solution work. – peak Feb 01 '18 at 16:13
  • @peak First of all, great solution splitting this into just 2 discrete runs. Secondly - this may be immaterial, but the `jq -c '.[] | [.[]] | add' |` I believe should be `jq -cr .....` - since it does not include the literal double quotes `"` in the digest. – hmedia1 Feb 02 '18 at 07:21
  • @hmedia1 - You make a valid point, but since the goal is to have a suitable “digest” for each entity in the array, I think the difference is probably immaterial. The important point is that `jq -c ‘.[]’` should be used to achieve the stated goal. – peak Feb 02 '18 at 07:36
  • @peak I see you've even noted something similar yourself in a comment to a different answer. I tend to lean toward an easily describable schema, for the sake of future proofing if the entries need to be externally verified. I.e. *"Digest format is: firstlast - for example "first": "John", "last": "Big" would calculate the digest of **`JohnBig`**"* - This is also fairly robust against typical variances in character encodings across platforms and ttys where line termination style, or things like smart quotes and whitespaces end up finding their way into the mix. – hmedia1 Feb 02 '18 at 07:59
  • i.e. The md5 of the entity JohnBig is `467ffeee8fea6aef01a6ffdcaf747782` - Nothing more to it, the engineer programming a third party system doesn't need to know anything about `jq`, just that the value of `id` in an array works out to the `md5` of the `first` + `last` of that same object – hmedia1 Feb 02 '18 at 08:05
  • @hmedia1 - It might be desirable to include the key names in the digest... – peak Feb 05 '18 at 09:16
0

I adapted accepted answer's script to my case, posting it here, it could be useful to someone.

input.json:

{"date":100,"text":"some text","name":"april"}
{"date":200,"text":"a b c","name":"may"}
{"date":300,"text":"some text","name":"april"}

output.json:

{"date":100,"text":"some text","name":"april","id":"4d93d51945b88325c213640ef59fc50b"}
{"date":200,"text":"a b c","name":"may","id":"3da904d79fb03e6e3936ff2127039b1a"}
{"date":300,"text":"some text","name":"april","id":"4d93d51945b88325c213640ef59fc50b"}

The bash script to generate output.json:

cat input.json |
while read -r line ; do
    jq -r '.text' <<< "$line" | md5 |
        jq -c --argjson line "$line" -R '$line + {id: .}' \
        >> output.json
done
Ikrom
  • 4,785
  • 5
  • 52
  • 76