Newtonsoft.json: cut JSON according to json path whitelist

Question

Suppose, I have some complex JSON:

{
    "path1": {
        "path1Inner1": {
            "id": "id1"
        },
        "path1Inner2": {
            "id": "id2"
        }
    },
    "path2": {
        "path2Inner1": {
            "id": "id3"
        },
        "path2Inner2": {
            "id": "id4",
            "key": "key4"
        }
    }
}

And there is also some whitelist of json path expressions, for example:

$.path1.path1Inner1
$.path2.path2Inner2.key

I want to leave in the JSON tree only nodes and properties that match the "whitelist", so, the result would be:

{
    "path1": {
        "path1Inner1": {
            "id": "id1"
        }
    },
    "path2": {
        "path2Inner2": {
            "key": "key4"
        }
    }
}

I.e. this is not just a selection by JSON path (which is a trivial task) but the nodes and properties have to keep the initial place in the source JSON tree.

I think `JsonExtensions.RemoveAllExcept(this TJToken obj, IEnumerable paths)` from [this answer](https://stackoverflow.com/a/30333562/3744182) to [How to perform partial object serialization providing "paths" using Newtonsoft JSON.NET](https://stackoverflow.com/q/30304128/3744182) does what you want. Can you confirm? — dbc, Nov 23 '21 at 19:30
@Peter Csala I'm studying both of the solutions and will make a decision shortly. — ademchenko, Nov 25 '21 at 14:59
@dbc thanks for your link. I've managed to implement the easier solution, using the idea of your approach. — ademchenko, Nov 26 '21 at 11:13

ademchenko · Accepted Answer · 2021-11-30T13:12:35.613

First of all, I have many thanks for this and this answers. They became the starting point of my analysis of the problem.

Those answers present two different approaches in achieving the goal of "whitelisting" by paths. The first one rebuilds the whitelist paths structure from scratch (i.e. starting from the empty object creates the needed routes). The implementation parses the string paths and tries to rebuild the tree based on the parsed path. This approach needs very handy work of considering all possible types of paths and therefore might be error-prone. You can find some of the mistakes I have found in my comment to the answer.

The second approach is based on the json.net object tree API (Parent, Ancestors, Descendants, etc. etc.). The algorithm traverses the tree and removes paths that are not "whitelisted". I find that approach much easier and much less error-prone as well as supporting the wide range of cases "in one go".

The algorithm I have implemented is in many points similar to the second answer but, I think, is much easier in implementation and understanding. Also, I don't think it is worse in its performance.

    public static class JsonExtensions
    {
        public static TJToken RemoveAllExcept<TJToken>(this TJToken token, IEnumerable<string> paths) where TJToken : JContainer
        {
            HashSet<JToken> nodesToRemove = new(ReferenceEqualityComparer.Instance);
            HashSet<JToken> nodesToKeep = new(ReferenceEqualityComparer.Instance);

            foreach (var whitelistedToken in paths.SelectMany(token.SelectTokens))
                TraverseTokenPath(whitelistedToken, nodesToRemove, nodesToKeep);

            //In that case neither path from paths has returned any token
            if (nodesToKeep.Count == 0)
            {
                token.RemoveAll();
                return token;
            }

            nodesToRemove.ExceptWith(nodesToKeep);

            foreach (var notWhitelistedNode in nodesToRemove)
                notWhitelistedNode.Remove();

            return token;
        }

        private static void TraverseTokenPath(JToken value, ISet<JToken> nodesToRemove, ISet<JToken> nodesToKeep)
        {
            JToken? immediateValue = value;

            do
            {
                nodesToKeep.Add(immediateValue);

                if (immediateValue.Parent is JObject or JArray)
                {
                    foreach (var child in immediateValue.Parent.Children())
                        if (!ReferenceEqualityComparer.Instance.Equals(child, value))
                            nodesToRemove.Add(child);
                }

                immediateValue = immediateValue.Parent;
            } while (immediateValue != null);
        }
    }

To compare the JToken instances it's necessary to use reference equality comparer since some of JToken types use "by value" comparison like JValue does. Otherwise, you could get buggy behaviour in some cases.

For example, having source JSON

{
   "path2":{
      "path2Inner2":[
         "id",
         "id"
      ]
   }
}

and a path $..path2Inner2[0] you will get the result JSON

{
   "path2":{
      "path2Inner2":[
         "id",
         "id"
      ]
   }
}

instead of

{
   "path2":{
      "path2Inner2":[
         "id"
      ]
   }
}

As far as .net 5.0 is concerned the standard ReferenceEqualityComparer can be used. If you use an earlier version of .net you might need to implement it.

score 0 · Answer 2 · answered Nov 23 '21 at 11:47

Let's suppose that you have a valid json inside a sample.json file:

{
  "path1": {
    "path1Inner1": {
      "id": "id1"
    },
    "path1Inner2": {
      "id": "id2"
    }
  },
  "path2": {
    "path2Inner1": {
      "id": "id3"
    },
    "path2Inner2": {
      "id": "id4",
      "key": "key4"
    }
  }
}

Then you can achieve the desired output with the following program:

static void Main()
{
    var whitelist = new[] { "$.path1.path1Inner1", "$.path2.path2Inner2.key" };
    var rawJson = File.ReadAllText("sample.json");

    var semiParsed = JObject.Parse(rawJson);
    var root = new JObject();

    foreach (var path in whitelist)
    {
        var value = semiParsed.SelectToken(path);
        if (value == null) continue; //no node exists under the path 
        var toplevelNode = CreateNode(path, value);
        root.Merge(toplevelNode);
    }

    Console.WriteLine(root);
}

We read the json file and semi parse it to a JObject
We define a root where will merge the processing results
We iterate through the whitelisted json paths to process them
We retrieve the actual value of the node (specified by the path) via the SelectToken call
If the path is pointing to a non-existing node then SelectToken returns null
Then we create a new JObject which contains the full hierarchy and the retrieved value
Finally we merge that object to the root

Now let's see the two helper methods

static JObject CreateNode(string path, JToken value)
{
    var entryLevels = path.Split('.').Skip(1).Reverse().ToArray();
    return CreateHierarchy(new Queue<string>(entryLevels), value);
}

We split the path by dots and remove the first element ($)
We reverse the order to be able to put it into a Queue
We want to build up the hierarchy from inside out
Finally we call a recursive function with the queue and the retrieved value

static JObject CreateHierarchy(Queue<string> pathLevels, JToken currentNode)
{
    if (pathLevels.Count == 0) return currentNode as JObject;

    var newNode = new JObject(new JProperty(pathLevels.Dequeue(), currentNode));
    return CreateHierarchy(pathLevels, newNode);
}

We first define the exit condition to make sure that we will not create an infinite recursion
We create a new JObject where we specify the name and value

The output of the program will be the following:

{
  "path1": {
    "path1Inner1": {
      "id": "id1"
    }
  },
  "path2": {
    "path2Inner2": {
      "key": "key4"
    }
  }
}

Peter Csala, many thanks for the attempt to solve my issue and the time you've spent in implementing the solution. But actually there are some issues there. First of all, there is a huge number of different types of paths allowed by jsonpath syntax like "$..id" , "$..[?(@.id)]", etc., etc. It's a very hard task to correctly parse them all. So, it is much easier to send value.Path to CreateNode. But, anyway, there are some other cases that haven't taken into account like "$.path[0]". — ademchenko, Nov 26 '21 at 10:47
Also, if you're not based on the existing tree but tries to recreate it you may get some other issues that are hard to fix. For example, having two whitelist paths "$.path[0]" and "$.path[5]" it's pretty hard to preserve the relative places of item 0 and item 5 in the whitelisted "path" array. All that issues causes me to think that it is much easier to rely on the source tree and just remove the nodes that are not whitelisted. That is what I've actually done in my answer. — ademchenko, Nov 26 '21 at 10:48
@ademchenko My implementation as you said it is not bulletproof. I've showed you an option how you might want to get started to build your own, by showing you some fundamental concepts. Because in your question you did not specify how generic the solution suppose to be that's why I've implemented a solution which works fine for the provided examples. — Peter Csala, Nov 26 '21 at 11:20
@ademchenko Reflecting to your second comment: yet again your example was super simple without having detailed requirements. So obviously I could not provide a profound solution which can work for each and every cases. I'm glad that you have built your own solution which satisfies all your needs. — Peter Csala, Nov 26 '21 at 11:26

Newtonsoft.json: cut JSON according to json path whitelist

2 Answers2

Linked