My goal is to be able to index a dataset which includes free-form JSON data. For the constraint of this project, I'm using Azure Cognitive Search. ACS handles all the data parsing from an external data source and I'm left to work with its tools (one of which is regex).
I'm using a python SDK, however it is merely a wrapper for the API used to configure ACS, thus I do not have the ability to parse the data on my end. It's not the ideal situation, but I'm working with the hand I've been dealt.
That said, here is what I'm looking for feedback on:
I have a JSON like the following:
{
"CategoryName":"Desktop",
"ProductDescription":"Computer",
"ExternalManufacturerName":[
"Computer Maker A",
"Computer Maker B"
],
"Brand":"Computer Maker A"
}
I am using an index to be able to query this JSON (Lucene query in Azure Cognitive Search). I am interested in breaking the JSON into tokens consisting of the individual key-value-pairs. For example:
[
"\"CategoryName\":\"Desktop\"",
"\"ProductDescription\":\"Computer\"",
"\"ExternalManufacturerName\":[
\"Computer Maker A\",
\"Computer Maker B\"
]",
"\"Brand\":\"Computer Maker A\""
]
The way the indexer I'm using works is that it can split strings into tokens based on delimiters matching a regex pattern. This starts out simple enough, I can match on \{\s+|,\s+|\s+\}
to get a 1-depth JSON to work, however, I have 2+-depth JSONs. I would like the regex to recognize when it is inside a lower depth level and ignore delimiters there.
How can I do this?