10

I'm looking to transform JSON using jq to a delimiter-separated and flattened structure.

There have been attempts at this. For example, Flatten nested JSON using jq.

However the solutions on that page fail if the JSON contains arrays. For example, if the JSON is:

{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}

The solution above will fail to transform the above to:

{"a.b.0":1,"x.0.y":2,"x.1.z":3}

In addition, I'm looking for a solution that will also allow for an arbitrary delimiter. For example, suppose the space character is the delimiter. In this case, the result would be:

{"a b 0":1,"x 0 y":2,"x 1 z":3}

I'm looking to have this functionality accessed via a Bash (4.2+) function as is found in CentOS 7, something like this:

flatten_json()
{
    local JSONData="$1"
    # jq command to flatten $JSONData, putting the result to stdout
    jq ... <<<"$JSONData"
}

The solution should work with all JSON data types, including null and boolean. For example, consider the following input:

{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}

It should produce:

{"a b 0":"p q r","w 0 x":null,"w 1 y":false,"w 2 z":3}
Community
  • 1
  • 1
Steve Amerige
  • 1,309
  • 1
  • 12
  • 28

3 Answers3

18

If you stream the data in, you'll get pairings of paths and values of all leaf values. If not a pair, then a path marking the end of a definition of an object/array at that path. Using leaf_paths as you found would only give you paths to truthy leaf values so you'll miss out on null or even false values. As a stream, you won't get this problem.

There are many ways this could be combined to an object, I'm partial to using reduce and assignment in these situations.

$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}

$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
    .[[$i[0][]|tostring]|join($delim)] = $i[1]
)' input.json
{
  "a.b.0": "p q r",
  "w.0.x": null,
  "w.1.y": false,
  "w.2.z": 3
}

Here's the same solution broken up a bit to allow room for explanation of what's going on.

$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
    [$i[0][]|tostring] as $path_as_strings
        | ($path_as_strings|join($delim)) as $key
        | $i[1] as $value
        | .[$key] = $value
)' input.json

Converting the input to a stream with tostream, we'll receive multiple values of pairs/paths as input to our filter. With this, we can pass those multiple values into reduce which is designed to accept multiple values and do something with them. But before we do, we want to filter those pairs/paths by only the pairs (select(length==2)).

Then in the reduce call, we're starting with a clean object and assigning new values using a key derived from the path and the corresponding value. Remember that every value produced in the reduce call is used for the next value in the iteration. Binding values to variables doesn't change the current context and assignments effectively "modify" the current value (the initial object) and passes it along.

$path_as_strings is just the path which is an array of strings and numbers to just strings. [$i[0][]|tostring] is a shorthand I use as an alternative to using map when the array I want to map is not the current array. This is more compact since the mapping is done as a single expression. That instead of having to do this to get the same result: ($i[0]|map(tostring)). The outer parentheses might not be necessary in general but, it's still two separate filter expressions vs one (and more text).

Then from there we convert that array of strings to the desired key using the provided delimiter. Then assign the appropriate values to the current object.

Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • Superb result, and I've tried it with a much-more complicated JSON file! Would you do me a favor and update your answer to explain in a bit more detail the mechanics of the jq statement that you've written? This would help not just me, but a lot of other people understand the sometimes confusing syntax. Again, well done! – Steve Amerige Feb 17 '17 at 17:47
  • If you have a moment, can your solution be modified so that if the key would otherwise require quoting when used as a selector (for example: `{"top":{"com.acme":37}}` would require the selector `.top."com.acme"`), then the output would have those path components quoted as shown in the example selector? Thanks again for a great solution! The reason I originally wanted to be able to provide a delimiter was a workaround for this issue. I could take the result and use `sed` to make it have the desired quoting. It does work, but it is clumsy. I'm hoping there is an elegant `jq` approach! – Steve Amerige Mar 01 '17 at 11:32
  • I really enjoy the solution, but when I use a JSON file with multiple entries it only operates on one of them. Perhaps something simple escapes me? – citizenrich Oct 17 '18 at 20:34
  • 1
    @citizenrich: what do you mean by "multiple entries"? what does your input look like? If it's just a json object followed by more json objects, it should just work as is. All inputs should get processed into the same result object. Unless there is some nuance in your input, then we would need to know how that looks like. – Jeff Mercado Oct 17 '18 at 20:38
  • My mistake. My use case is to input separate objects and output separate objects. I used ndjson as input the result is that only one object is processed. If I switch my input into nested json objects inside an array, e.g. [{},{}], then I do get all results processed into the same object. But I would still prefer separate json objects in output. – citizenrich Oct 18 '18 at 18:03
  • 1
    @citizenrich: Yeah, there will need to be some changes for it to be able to take in separate inputs. The way this is currently written, it's assuming a single input. But it should be a simple tweak. When I get the time, I'll put up a more generalized version. – Jeff Mercado Oct 18 '18 at 18:09
  • 1
    @citizenrich: I've updated the answer to support multiple inputs. But you won't have to put them in an array, just "stream" the inputs in. And by that, I mean just put the object inputs one after the other. i.e., `{} {}` – Jeff Mercado Oct 18 '18 at 19:27
2

The following has been tested with jq 1.4, jq 1.5 and the current "master" version. The requirement about including paths to null and false is the reason for "allpaths" and "all_leaf_paths".

# all paths, including paths to null
def allpaths:
  def conditional_recurse(f):  def r: ., (select(.!=null) | f | r); r;
  path(conditional_recurse(.[]?)) | select(length > 0);

def all_leaf_paths:
  def isscalar: type | (. != "object" and . != "array");
  allpaths as $p
  | select(getpath($p)|isscalar)
  | $p ;


. as $in 
| reduce all_leaf_paths as $path ({};
     . + { ($path | map(tostring) | join($delim)): $in | getpath($path) })

With this jq program in flatten.jq:

$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}

$ jq --arg delim . -f flatten.jq input.json

{
  "a.b.0": "p q r",
  "w.0.x": null,
  "w.1.y": false,
  "w.2.z": 3
}

Collisions

Here is a helper function that illustrates an alternative path-flattening algorithm. It converts keys that contain the delimiter to quoted strings, and array elements are presented in square brackets (see the example below):

def flattenPath(delim):
  reduce .[] as $s ("";
    if $s|type == "number" 
    then ((if . == "" then "." else . end) + "[\($s)]")
    else . + ($s | tostring | if index(delim) then "\"\(.)\"" else . end)
    end );

Example: Using flattenPath instead of map(tostring) | join($delim), the object:

 {"a.b": [1]}

would become:

{
  "\"a.b\"[0]": 1
}
peak
  • 105,803
  • 17
  • 152
  • 177
  • How does this solution compare to that presented by @jeff-mercado in terms of correctness, performance, and safety? I very much appreciate this post if nothing more than to see how you think about solving problems. – Steve Amerige Feb 18 '17 at 14:25
0

To add a new option to the solutions already given, jqg is a script I wrote to flatten any JSON file and then search it using a regex. For your purposes your regex would simply be '.' which would match everything.

$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg .
{
  "a.b.0": 1,
  "x.0.y": 2,
  "x.1.z": 3
}

and can produce compact output:

$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg -q -c .
{"a.b.0":1,"x.0.y":2,"x.1.z":3}

It also handles the more complicated example that @peak used:

$ echo '{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}' | jqg .
{
  "a.b.0": "p q r",
  "w.0.x": null,
  "w.1.y": false,
  "w.2.z": 3
}

as well as empty arrays and objects (and a few other edge-case values):

$ jqg . test/odd-values.json
{
  "one.start-string": "foo",
  "one.null-value": null,
  "one.integer-number": 101,
  "two.two-a.non-integer-number": 101.75,
  "two.two-a.number-zero": 0,
  "two.true-boolean": true,
  "two.two-b.false-boolean": false,
  "three.empty-string": "",
  "three.empty-object": {},
  "three.empty-array": [],
  "end-string": "bar"
}

(reporting empty arrays & objects can be turned off with the -E option).

jqg was tested with jq 1.6

Note : I am the author of the jqg script.

Joe Casadonte
  • 15,888
  • 11
  • 45
  • 57