Using JSONPath to filter properties in JSON documents

Question

I have an arbitrarily defined JSON document, and I want to be able to apply a JSONPath expression like a whitelist filter for properties: All selected nodes and their ancestors back to the root node remain, all other nodes are removed. If the nodes don't exist, I should end up with an empty document.

There didn't seem to be anything similar to this built into JSON.Net and I couldn't find similar examples anywhere, so I built my own. I opted to copy selected nodes into a newly built document rather than try and remove all nodes that didn't match. Given that there could be multiple matches and documents could be large, it needed to be able to handle merging the multiple selection results efficiently into a single tree/JSON document.

My attempt sort of works, but I'm getting strange results. The process involves a MergedAncestry method which iterates over the SelectTokens results, calls GetFullAncestry (which recursively builds the tree to that node), then merges the results. It seems the merging of JArrays is happening at the wrong level though, as you can see under "Actual results" below.

My questions:

Is there a better/faster/built-in way to achieve this?
If not, what am I doing wrong?

Code:

public static void Main()
{
    string json = @"..."; // snipped for brevity - see DotNetFiddle: https://dotnetfiddle.net/wKN1Hj
    var root = (JContainer)JToken.Parse(json);
    var t3 = root.SelectTokens("$.Array3B.[*].Array3B1.[*].*");

    // See DotNetFiddle for simpler examples that work
    Console.WriteLine($"{MergedAncestry(t3).ToString()}");  // Wrong output!

    Console.ReadKey();
}

// Returns a single document merged using the full ancestry of each of the input tokens
static JToken MergedAncestry(IEnumerable<JToken> tokens)
{
    JObject merged = null;
    foreach(var token in tokens)
    {
        if (merged == null)
        {
            // First object
            merged = (JObject)GetFullAncestry(token);
        }
        else
        {
            // Subsequent objects merged
            merged.Merge((JObject)GetFullAncestry(token), new JsonMergeSettings
            {
                // union array values together to avoid duplicates
                MergeArrayHandling = MergeArrayHandling.Union
            });
        }
    }
    return merged ?? new JObject();
}

// Recursively builds a new tree to the node matching the ancestry of the original node
static JToken GetFullAncestry(JToken node, JToken tree = null)
{
    if (tree == null)
    {
        // First level: start by cloning the current node
        tree = node?.DeepClone();
    }

    if (node?.Parent == null)
    {
        // No parents left, return the tree we've built
        return tree;
    }

    // Rebuild the parent node in our tree based on the type of node
    JToken a;
    switch (node.Parent)
    {
        case JArray _:
            return GetFullAncestry(node.Parent, new JArray(tree));
        case JProperty _:
            return GetFullAncestry(node.Parent, new JProperty(((JProperty)node.Parent).Name, tree));
        case JObject _:
            return GetFullAncestry(node.Parent, new JObject(tree));
        default:
            return tree;
    }
}

Example JSON:

{
  "Array3A": [
    { "Item_3A1": "Desc_3A1" }
  ],
  "Array3B": [
    { "Item_3B1": "Desc_3B1" },
    {
      "Array3B1": [
        { "Item_1": "Desc_3B11" },
        { "Item_2": "Desc_3B12" },
        { "Item_3": "Desc_3B13" }
      ]
    },
    {
      "Array3B2": [
        { "Item_1": "Desc_3B21" },
        { "Item_2": "Desc_3B22" },
        { "Item_3": "Desc_3B23" }
      ]
    }
  ]
}

See DotNetFiddle for full code and tests

"Filter" JSONPath:

$.Array3B.[*].Array3B1.[*].*

Expected results:

{
    "Array3B": [
    {
        "Array3B1": [
        { "Item_1": "Desc_3B11" },
        { "Item_2": "Desc_3B12" },
        { "Item_3": "Desc_3B13" }
        ]
    }
    ]
}

Actual results:

{
    "Array3B": [
    {
        "Array3B1": [ { "Item_1": "Desc_3B11" } ]
    },
    {
        "Array3B1": [ { "Item_2": "Desc_3B12" } ]
    },
    {
        "Array3B1": [ { "Item_3": "Desc_3B13" } ]
    }
    ]
}

Thinking about this a bit more, I can see why the merge result occurs: technically, it's a correct top-down merge. However I need a bottom-up merge in the context of the original document ie. preserving common ancestors. I'm working on a solution that involves recursively building the tree from the leaf nodes up, while recognising common ancesters, but would appreciate if anyone has any better suggestions! — pcdev, Aug 14 '19 at 20:23
Are you looking for `JsonExtensions.RemoveAllExcept(this TJToken obj, IEnumerable paths)` from [this answer](https://stackoverflow.com/a/30333562/3744182) to [How to perform partial object serialization providing “paths” using Newtonsoft JSON.NET](https://stackoverflow.com/q/30304128/3744182)? — dbc, Aug 17 '19 at 05:24
Oh wow. Thanks @dbc! That's exactly what I was looking for! I haven't tried it yet to test for speed, but I think that's going to perform far better in most cases than rebuilding a copy of the tree as I've done below. Will look at it in the next day or two and advise, but I'm pretty confident it will do what I need. Thanks again! — pcdev, Aug 17 '19 at 21:48
@dbc: Your method works really well, thanks. I've done some testing and to my great surprise, after fixing a small bug in my code it seems that my method performs about the same as your `RemoveAllExcept` method when the input JSONPath strings match all nodes, about 50-100% faster when it matches half the nodes, and an order of magnitude faster for a small number of matches (8MB doc, 100 iterations). Not destroying the original is going to be beneficial for my purposes, so I'm going to stick with my answer but I'm happy to upvote if you add your answer. Thanks again for your contribution! — pcdev, Aug 19 '19 at 01:16
@dbc DotNetFiddle comparison: https://dotnetfiddle.net/i5Qlam — pcdev, Aug 19 '19 at 01:18

pcdev · Accepted Answer · 2019-08-19T23:22:20.297

Ok, I have found a way to do it. Thanks to @dbc for suggestions, improvements and pointing out issues.

Recursion wasn't going to work so well in the end, as I needed to ensure that all nodes at the same level in the tree with a common parent would be matched, whereas there could potentially be input nodes at any level.

I've added a method to do filtering on multiple JSONPaths to output a single result document, as that was the original goal.

static JToken FilterByJSONPath(JToken document, IEnumerable<string> jPaths)
{
    var matches = jPaths.SelectMany(path => document.SelectTokens(path, false));
    return MergeAncestry(matches);
}

static JToken MergeAncestry(IEnumerable<JToken> tokens)
{
    if (tokens == null || !tokens.Any())
    {
        return new JObject();
    }

    // Get a dictionary of tokens indexed by their depth
    var tokensByDepth = tokens
        .Distinct(ObjectReferenceEqualityComparer<JToken>.Default)
        .GroupBy(t => t.Ancestors().Count())
        .ToDictionary(
            g => g.Key, 
            g => g.Select(node => new CarbonCopyToken { Original = node, CarbonCopy = node.DeepClone() })
                    .ToList());

    // start at the deepest level working up
    int depth = tokensByDepth.Keys.Max();
    for (int i = depth; i > 0; i--)
    {
        // If there's nothing at the next level up, create a list to hold parents of children at this level
        if (!tokensByDepth.ContainsKey(i - 1))
        {
            tokensByDepth.Add(i - 1, new List<CarbonCopyToken>());
        }

        // Merge all tokens at this level into families by common parent
        foreach (var parent in MergeCommonParents(tokensByDepth[i]))
        {
            tokensByDepth[i - 1].Add(parent);
        }
    }

    // we should be left with a list containing a single CarbonCopyToken - contining the root of our copied document and the root of the source
    var cc = tokensByDepth[0].FirstOrDefault();
    return cc?.CarbonCopy ?? new JObject();
}

static IEnumerable<CarbonCopyToken> MergeCommonParents(IEnumerable<CarbonCopyToken> tokens)
{
    var newParents = tokens.GroupBy(t => t.Original.Parent).Select(g => new CarbonCopyToken {
        Original = g.First().Original.Parent,
        CarbonCopy = CopyCommonParent(g.First().Original.Parent, g.AsEnumerable())
        });
    return newParents;
}

static JToken CopyCommonParent(JToken parent, IEnumerable<CarbonCopyToken> children)
{
    switch (parent)
    {
        case JProperty _:
            return new JProperty(((JProperty)parent).Name, children.First().CarbonCopy);
        case JArray _:
            var newParentArray = new JArray();
            foreach (var child in children)
            {
                newParentArray.Add(child.CarbonCopy);
            }
            return newParentArray;
        default: // JObject, or any other type we don't recognise
            var newParentObject = new JObject();
            foreach (var child in children)
            {
                newParentObject.Add(child.CarbonCopy);
            }
            return newParentObject;
    }

}

Notice it uses a couple of new classes: CarbonCopyToken allows us to keep track of nodes and their copies as we work up the tree level by level, and ObjectReferenceEqualityComparer<T> which prevents duplicates with the Distinct method (thanks again @dbc for pointing this out):

public class CarbonCopyToken
{
    public JToken Original { get; set; }
    public JToken CarbonCopy { get; set; }
}

/// <summary>
/// A generic object comparerer that would only use object's reference, 
/// ignoring any <see cref="IEquatable{T}"/> or <see cref="object.Equals(object)"/>  overrides.
/// </summary>
public class ObjectReferenceEqualityComparer<T> : IEqualityComparer<T> where T : class
{
    // Adapted from this answer https://stackoverflow.com/a/1890230
    // to https://stackoverflow.com/questions/1890058/iequalitycomparert-that-uses-referenceequals
    // By https://stackoverflow.com/users/177275/yurik
    private static readonly IEqualityComparer<T> _defaultComparer;

    static ObjectReferenceEqualityComparer() { _defaultComparer = new ObjectReferenceEqualityComparer<T>(); }

    public static IEqualityComparer<T> Default { get { return _defaultComparer; } }

    #region IEqualityComparer<T> Members

    public bool Equals(T x, T y)
    {
        return ReferenceEquals(x, y);
    }

    public int GetHashCode(T obj)
    {
        return System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode(obj);
    }

    #endregion
}

Example usage:

List<string> filters = new {
    "$..Test1",
    "$.Path.To.[*].Some.Nodes",
    "$.Other.*.Nodes"
}
var result = FilterByJSONPath(inputDocument, filters);

DotNetFiddle showing the previous tests plus one extra one: https://dotnetfiddle.net/ekABRI

I like the fact that you method creates a new `JToken` tree rather than pruning the incoming tree. But it looks like there may be a bug. If I use the JSON and filters from [Creating reduced json from a bigger json in c#](https://stackoverflow.com/q/56764017/3744182) then this algorithm produces a different result than `RemoveAllExcept()`. See: https://dotnetfiddle.net/JwwXjP — dbc, Aug 19 '19 at 20:04
The problem may be because `JValue` overrides `Equals()`, so if two different `JValue` objects have identical values, they can get merged when added to a hash table (either directly or via `.Distinct()`). — dbc, Aug 19 '19 at 20:14
@dbc thanks for that, very interesting, after reading your comment I reproduced that behaviour but in my case this will not be a problem. I'll add a comment to this answer to clarify this for others. — pcdev, Aug 19 '19 at 20:52
@dbc on second thoughts, this may be an issue after all. I also see now why your solution used `ObjectReferenceEqualityComparer` - great solution to the `Equals` problem. I've added the same IEqualityComparer to `Distinct` and it's now producing the same results as your solution. Thanks once again! — pcdev, Aug 19 '19 at 21:26

Using JSONPath to filter properties in JSON documents

1 Answers1

Linked