2

Suppose, I have some complex JSON:

{
    "path1": {
        "path1Inner1": {
            "id": "id1"
        },
        "path1Inner2": {
            "id": "id2"
        }
    },
    "path2": {
        "path2Inner1": {
            "id": "id3"
        },
        "path2Inner2": {
            "id": "id4",
            "key": "key4"
        }
    }
}

And there is also some whitelist of json path expressions, for example:

  • $.path1.path1Inner1
  • $.path2.path2Inner2.key

I want to leave in the JSON tree only nodes and properties that match the "whitelist", so, the result would be:

{
    "path1": {
        "path1Inner1": {
            "id": "id1"
        }
    },
    "path2": {
        "path2Inner2": {
            "key": "key4"
        }
    }
}

I.e. this is not just a selection by JSON path (which is a trivial task) but the nodes and properties have to keep the initial place in the source JSON tree.

Peter Csala
  • 17,736
  • 16
  • 35
  • 75
ademchenko
  • 585
  • 5
  • 18
  • Could you please fix your json samples to be valid ones? – Peter Csala Nov 23 '21 at 11:10
  • I think `JsonExtensions.RemoveAllExcept(this TJToken obj, IEnumerable paths)` from [this answer](https://stackoverflow.com/a/30333562/3744182) to [How to perform partial object serialization providing "paths" using Newtonsoft JSON.NET](https://stackoverflow.com/q/30304128/3744182) does what you want. Can you confirm? – dbc Nov 23 '21 at 19:30
  • @ademchenko Did any of the proposed solutions work for you? – Peter Csala Nov 25 '21 at 14:13
  • 1
    @Peter Csala I'm studying both of the solutions and will make a decision shortly. – ademchenko Nov 25 '21 at 14:59
  • @dbc thanks for your link. I've managed to implement the easier solution, using the idea of your approach. – ademchenko Nov 26 '21 at 11:13

2 Answers2

1

First of all, I have many thanks for this and this answers. They became the starting point of my analysis of the problem.

Those answers present two different approaches in achieving the goal of "whitelisting" by paths. The first one rebuilds the whitelist paths structure from scratch (i.e. starting from the empty object creates the needed routes). The implementation parses the string paths and tries to rebuild the tree based on the parsed path. This approach needs very handy work of considering all possible types of paths and therefore might be error-prone. You can find some of the mistakes I have found in my comment to the answer.

The second approach is based on the json.net object tree API (Parent, Ancestors, Descendants, etc. etc.). The algorithm traverses the tree and removes paths that are not "whitelisted". I find that approach much easier and much less error-prone as well as supporting the wide range of cases "in one go".

The algorithm I have implemented is in many points similar to the second answer but, I think, is much easier in implementation and understanding. Also, I don't think it is worse in its performance.

    public static class JsonExtensions
    {
        public static TJToken RemoveAllExcept<TJToken>(this TJToken token, IEnumerable<string> paths) where TJToken : JContainer
        {
            HashSet<JToken> nodesToRemove = new(ReferenceEqualityComparer.Instance);
            HashSet<JToken> nodesToKeep = new(ReferenceEqualityComparer.Instance);

            foreach (var whitelistedToken in paths.SelectMany(token.SelectTokens))
                TraverseTokenPath(whitelistedToken, nodesToRemove, nodesToKeep);

            //In that case neither path from paths has returned any token
            if (nodesToKeep.Count == 0)
            {
                token.RemoveAll();
                return token;
            }

            nodesToRemove.ExceptWith(nodesToKeep);

            foreach (var notWhitelistedNode in nodesToRemove)
                notWhitelistedNode.Remove();

            return token;
        }

        private static void TraverseTokenPath(JToken value, ISet<JToken> nodesToRemove, ISet<JToken> nodesToKeep)
        {
            JToken? immediateValue = value;

            do
            {
                nodesToKeep.Add(immediateValue);

                if (immediateValue.Parent is JObject or JArray)
                {
                    foreach (var child in immediateValue.Parent.Children())
                        if (!ReferenceEqualityComparer.Instance.Equals(child, value))
                            nodesToRemove.Add(child);
                }

                immediateValue = immediateValue.Parent;
            } while (immediateValue != null);
        }
    }

To compare the JToken instances it's necessary to use reference equality comparer since some of JToken types use "by value" comparison like JValue does. Otherwise, you could get buggy behaviour in some cases.

For example, having source JSON

{
   "path2":{
      "path2Inner2":[
         "id",
         "id"
      ]
   }
}

and a path $..path2Inner2[0] you will get the result JSON

{
   "path2":{
      "path2Inner2":[
         "id",
         "id"
      ]
   }
}

instead of

{
   "path2":{
      "path2Inner2":[
         "id"
      ]
   }
}

As far as .net 5.0 is concerned the standard ReferenceEqualityComparer can be used. If you use an earlier version of .net you might need to implement it.

ademchenko
  • 585
  • 5
  • 18
0

Let's suppose that you have a valid json inside a sample.json file:

{
  "path1": {
    "path1Inner1": {
      "id": "id1"
    },
    "path1Inner2": {
      "id": "id2"
    }
  },
  "path2": {
    "path2Inner1": {
      "id": "id3"
    },
    "path2Inner2": {
      "id": "id4",
      "key": "key4"
    }
  }
}

Then you can achieve the desired output with the following program:

static void Main()
{
    var whitelist = new[] { "$.path1.path1Inner1", "$.path2.path2Inner2.key" };
    var rawJson = File.ReadAllText("sample.json");

    var semiParsed = JObject.Parse(rawJson);
    var root = new JObject();

    foreach (var path in whitelist)
    {
        var value = semiParsed.SelectToken(path);
        if (value == null) continue; //no node exists under the path 
        var toplevelNode = CreateNode(path, value);
        root.Merge(toplevelNode);
    }

    Console.WriteLine(root);
}
  1. We read the json file and semi parse it to a JObject
  2. We define a root where will merge the processing results
  3. We iterate through the whitelisted json paths to process them
  4. We retrieve the actual value of the node (specified by the path) via the SelectToken call
  5. If the path is pointing to a non-existing node then SelectToken returns null
  6. Then we create a new JObject which contains the full hierarchy and the retrieved value
  7. Finally we merge that object to the root

Now let's see the two helper methods

static JObject CreateNode(string path, JToken value)
{
    var entryLevels = path.Split('.').Skip(1).Reverse().ToArray();
    return CreateHierarchy(new Queue<string>(entryLevels), value);
}
  1. We split the path by dots and remove the first element ($)
  2. We reverse the order to be able to put it into a Queue
  3. We want to build up the hierarchy from inside out
  4. Finally we call a recursive function with the queue and the retrieved value
static JObject CreateHierarchy(Queue<string> pathLevels, JToken currentNode)
{
    if (pathLevels.Count == 0) return currentNode as JObject;

    var newNode = new JObject(new JProperty(pathLevels.Dequeue(), currentNode));
    return CreateHierarchy(pathLevels, newNode);
}
  1. We first define the exit condition to make sure that we will not create an infinite recursion
  2. We create a new JObject where we specify the name and value

The output of the program will be the following:

{
  "path1": {
    "path1Inner1": {
      "id": "id1"
    }
  },
  "path2": {
    "path2Inner2": {
      "key": "key4"
    }
  }
}
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
  • Peter Csala, many thanks for the attempt to solve my issue and the time you've spent in implementing the solution. But actually there are some issues there. First of all, there is a huge number of different types of paths allowed by jsonpath syntax like "$..id" , "$..[?(@.id)]", etc., etc. It's a very hard task to correctly parse them all. So, it is much easier to send value.Path to CreateNode. But, anyway, there are some other cases that haven't taken into account like "$.path[0]". – ademchenko Nov 26 '21 at 10:47
  • Also, if you're not based on the existing tree but tries to recreate it you may get some other issues that are hard to fix. For example, having two whitelist paths "$.path[0]" and "$.path[5]" it's pretty hard to preserve the relative places of item 0 and item 5 in the whitelisted "path" array. All that issues causes me to think that it is much easier to rely on the source tree and just remove the nodes that are not whitelisted. That is what I've actually done in my answer. – ademchenko Nov 26 '21 at 10:48
  • @ademchenko My implementation as you said it is not bulletproof. I've showed you an option how you might want to get started to build your own, by showing you some fundamental concepts. Because in your question you did not specify how generic the solution suppose to be that's why I've implemented a solution which works fine for the provided examples. – Peter Csala Nov 26 '21 at 11:20
  • @ademchenko Reflecting to your second comment: yet again your example was super simple without having detailed requirements. So obviously I could not provide a profound solution which can work for each and every cases. I'm glad that you have built your own solution which satisfies all your needs. – Peter Csala Nov 26 '21 at 11:26