Select Delimiters in JSON Using Regex

Question

My goal is to be able to index a dataset which includes free-form JSON data. For the constraint of this project, I'm using Azure Cognitive Search. ACS handles all the data parsing from an external data source and I'm left to work with its tools (one of which is regex).

I'm using a python SDK, however it is merely a wrapper for the API used to configure ACS, thus I do not have the ability to parse the data on my end. It's not the ideal situation, but I'm working with the hand I've been dealt.

That said, here is what I'm looking for feedback on:

I have a JSON like the following:

{
    "CategoryName":"Desktop",
    "ProductDescription":"Computer",
    "ExternalManufacturerName":[
        "Computer Maker A",
        "Computer Maker B"
    ],
    "Brand":"Computer Maker A"
}

I am using an index to be able to query this JSON (Lucene query in Azure Cognitive Search). I am interested in breaking the JSON into tokens consisting of the individual key-value-pairs. For example:

[
    "\"CategoryName\":\"Desktop\"",
    "\"ProductDescription\":\"Computer\"",
    "\"ExternalManufacturerName\":[
    \"Computer Maker A\",
    \"Computer Maker B\"
]",
    "\"Brand\":\"Computer Maker A\""
]

The way the indexer I'm using works is that it can split strings into tokens based on delimiters matching a regex pattern. This starts out simple enough, I can match on \{\s+|,\s+|\s+\} to get a 1-depth JSON to work, however, I have 2+-depth JSONs. I would like the regex to recognize when it is inside a lower depth level and ignore delimiters there.

How can I do this?

https://stackoverflow.com/questions/15501285/storing-and-retrieving-json-object-to-from-lucene-indexes or https://ignaciosuay.com/getting-started-with-lucene-and-json-indexing/ — MonkeyZeus, May 25 '23 at 18:35
@MonkeyZeus In my other question [here](https://stackoverflow.com/questions/76333633/how-to-index-dynamic-data-types-in-azure-cognitive-search) my JSON does not have a consistent structure, and thus doesn't seem to index properly. Perhaps you can suggest a fix there? — Groger, May 25 '23 at 18:51
What programming language are you using (if any)? Java? C# (for Lucene.Net)? I would use a JSON parsing library, supported by that language, to parse and prepare your data; and then use Lucene for indexing, after that. I would personally avoid parsing JSON using a regex. A decent JSON parser should provide JSON as a tree object - which should be sufficient for your needs, especially if you don't have a "consistent structure". — andrewJames, May 25 '23 at 19:15
@andrewJames I'm building the index with the python SDK for Azure Cognitive Search. As mentioned in my other comment, I am having issues parsing JSON with it. — Groger, May 25 '23 at 19:52
OK - so if you are using Python, then can you not [parse your JSON first](https://www.google.com/search?q=parse+json+using+Python+site%253Astackoverflow.com) before indexing? No regexes needed. Have I misunderstood the problem you are trying to solve? — andrewJames, May 25 '23 at 20:07
The JSON you've shown is valid JSON so use a proper parser to parse it. If you get an invalid JSON document, complain to the producer of the document. — Ted Lyngmo, May 25 '23 at 22:55
If you're having problems parsing JSON with Python then regex is extra wrong. You should post a question about your Python issue first because that language has a JSON parser and that fact that you're avoiding it tells me the problem exists between the keyboard and chair. — MonkeyZeus, May 26 '23 at 12:42
Once you have an okay grip on parsing JSON, you may wish to fiddle with something like [JSONPath](https://pypi.org/project/jsonpath-ng/) — MonkeyZeus, May 26 '23 at 12:43
@andrewJames The issue is that I do not touch the data. For the constraint of this project, I'm using Azure Cognitive Search. ACS handles all the data parsing from an external data source and I'm left to work with its tools (one of which is regex). The python SDK is merely a wrapper for the API used to configure it, thus it does me no good to parse the data on my end, hence why I did not label the question with python. It's not the ideal situation, but I'm working with the hand I've been dealt. — Groger, May 26 '23 at 13:27
Your last comment should be part of your question. It directly affects how the question can be answered. — andrewJames, May 26 '23 at 13:32
Also, consider adding more relevant tags to the question (e.g. for ACS). — andrewJames, May 26 '23 at 13:33

Select Delimiters in JSON Using Regex

0 Answers0