2

Say I have a JSON with a 0xb7 byte encoded as a UTF codepoint:

{"key":"_\u00b7_"}

If I extract the value of the "key" with jq it keeps the utf8 encoding of this byte which is "c2 b7":

$ echo '{"key":"_\u00b7_"}' | ./jq '.key' -r | xxd
0000000: 5fc2 b75f 0a                             _.._.

Is there any jq command that extracts the decoded "5f b7 5f" byte sequence out of this JSON?

I can solve this with extra tools like iconv but it's a bit ugly:

$ echo '{"key":"_\u00b7_"}' | ./jq '.key' -r \
      | iconv -f utf8 -t utf32le \
      | xxd -ps | sed -e 's/000000//g' | xxd -ps -r \
      | xxd
0000000: 5fb7 5f0a                                _._.
peak
  • 105,803
  • 17
  • 152
  • 177
salmin
  • 457
  • 3
  • 12

2 Answers2

3
def hx:
  def hex: [if . < 10 then 48 + . else  55 + . end] | implode ;
  tonumber | "\(./16 | floor | hex)\(. % 16 | hex)";

{"key":"_\u00b7_"} | .key | explode | map(hx)

produces:

["5F","B7","5F"]

"Raw Bytes" (caveat emptor)

Since jq only supports UTF-8 strings, you would have to use some external tool to obtain the "raw bytes". Maybe this is closer to what you want:

jq -nrj '{"key":"_\u00b7_"} | .key' | iconv -f utf-8 -t ISO8859-1

This produces the three bytes.

And here's an iconv-free solution:

jq -nrj '{"key":"_\u00b7_"} | .key' | php -r 'print utf8_decode(readline());'
peak
  • 105,803
  • 17
  • 152
  • 177
  • Thank you! The "explode" operator performs the actual "utf8 => codepoint number" decoding, after that there's a number of ways to proceed with this data. However, from the jq sources it seems that currently there's no way to obtain the actual "5f b7 5f" byte sequence (I mean just 3 bytes, not a hex-encoded string) as an output as jq always ensures that it's dealing with a correct utf8 byte sequence. Therefore the binary output should always have some kind of encoding (utf8, hex, base64, uri etc). – salmin Jan 14 '18 at 13:10
2

Alternate

Addressing the character encoding scenario outside of jq:

Though you didn't want extra tools, iconv and hexdump are indeed readily available - I for one frequently lean on iconv when I require certain parts of a pipeline to be completely known to me, and hexdump when I want control of the formatting of the representation of those parts.

So an alternative is:

jq -njr '{"key":"_\u00b7_"} | .key' | iconv -f utf8 -t UTF-32LE | hexdump -ve '1/1 "%.X"'

Result:

5FB75F

hmedia1
  • 5,552
  • 2
  • 22
  • 27