19

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.

I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.

The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).

I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).

Kindly help.

EDIT:

I think the pair of examples 3&4 solves this for me: https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/

I'll try soon.

assafmo
  • 1,047
  • 3
  • 15
  • 32

6 Answers6

29

You can also use the following jq command to flatten nested JSON objects in this manner:

[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries

The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.

Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • 3
    This solution does not work when the JSON contains arrays. For example: `{"a":{"b":[1]}}` for which an error is raised: **jq: error (at :1): string (".") and number (0) cannot be added** – Steve Amerige Feb 17 '17 at 13:17
  • 4
    Great answer, though this filters out any values which evaluate to `false`, i.e. `false`, `null`, etc. This is because `leaf_paths` is a short-hand for `paths(scalars)`, and while `scalars` does select those, `paths` only returns entries for which they aren't false. Long story short, substitute `leaf_paths` with `paths(type != "object" and type != "array")` to include everything. – hraban Apr 30 '20 at 09:21
  • 1
    To fix the error **jq: error (at :1): string (".") and number (0) cannot be added** `[leaf_paths as $path | {"key": [$path[] | tostring] | join("."), "value": getpath($path)}] | from_entries` – Abhijit Jul 22 '21 at 19:05
19

This is just a variant of Santiago's jq:

. as $in 
| reduce leaf_paths as $path ({};
     . + { ($path | map(tostring) | join(".")): $in | getpath($path) })

It avoids the overhead of the key/value construction and destruction.

(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)

Two important points about both these jq solutions:

  1. Arrays are also flattened. E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:

    {
      "a.b.0": 0,
      "a.b.1": 1,
      "a.b.2": 2
    }
    
  2. If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:

    {"a.b":0, "a": {"b": 1}}
    
peak
  • 105,803
  • 17
  • 152
  • 177
  • 1
    @SteveAmerige - The answer has been updated so that it will work with jq 1.4 and later. – peak Feb 18 '17 at 04:42
4

Here is a solution that uses tostream, select, join, reduce and setpath

  reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
     {}
     ; setpath($p; $v)
  )
jq170727
  • 13,159
  • 3
  • 46
  • 56
3

I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.

$ jq . test/odd-values.json
{
  "one": {
    "start-string": "foo",
    "null-value": null,
    "integer-number": 101
  },
  "two": [
    {
      "two-a": {
        "non-integer-number": 101.75,
        "number-zero": 0
      },
      "true-boolean": true,
      "two-b": {
        "false-boolean": false
      }
    }
  ],
  "three": {
    "empty-string": "",
    "empty-object": {},
    "empty-array": []
  },
  "end-string": "bar"
}

$ jqg . test/odd-values.json
{
  "one.start-string": "foo",
  "one.null-value": null,
  "one.integer-number": 101,
  "two.0.two-a.non-integer-number": 101.75,
  "two.0.two-a.number-zero": 0,
  "two.0.true-boolean": true,
  "two.0.two-b.false-boolean": false,
  "three.empty-string": "",
  "three.empty-object": {},
  "three.empty-array": [],
  "end-string": "bar"
}

jqg was tested using jq 1.6

Note: I am the author of the jqg script.

Joe Casadonte
  • 15,888
  • 11
  • 45
  • 57
1

As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d @json_file does just this:

{
    "a.b":[1],
    "id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
    "_version_":1535841499921514496
}

EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).

EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \;.

EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)

assafmo
  • 1,047
  • 3
  • 15
  • 32
0

As @hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:

. as $in
     | reduce paths(type != "object" and type != "array") as $path ({};
          . + { ($path | map(tostring) | join(".")): $in | getpath($path) })
yyny
  • 1,623
  • 18
  • 20