How to get an unknown substring between two known substrings, within a giant string/file

Question

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...

Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).

The best I found was that: How to find a substring of text with a known starting point but unknown ending point in python

but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].

Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)

To give some details here's my script so far:

with open("file.ndjson","rt", encoding='utf-8') as ndjson:
    filedata = ndjson.read()
    x="customLabel"
    count=filedata.count(x)
    for i in range (count):
        if filedata.find(x)>0:
            print("Found "+str(i+1))

So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.

I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...

Momo · Accepted Answer · 2022-10-11T12:29:56.353

0

If you want to search for all (even nested) customLabel values like this:

{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}

you can use RegEx patterns with the re module

import re

label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
    for line in ndjson:
        values = re.findall(regex_pattern, line)
        label_values.extend(values)

print(label_values) # ['"Month"', '23525235']

# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']

Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.

import json

label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
    for line in ndjson:
        line_json = json.loads(line)
        if line_json.get(label) is not None:
            label_values.append(line_json.get(label))


print(label_values) # ['Month']

edited Oct 11 '22 at 12:29

answered Oct 11 '22 at 11:05

Momo

794
2
7
17

Seems like it doesn't work fully: the label_values is full of "None". The right amount tho, so it did get all the customLabel, but wasn't able to get their value? – Ipatchy.M Oct 11 '22 at 11:29
@Ipatchy.M It stores all the values for key `customLabel` into the list. If the value of a key is `null` then it'll be `None` when it gets parsed into Python objects. Can you provide more texts (including customLabel) of your file in the post? So I can test it – Momo Oct 11 '22 at 11:41
I actually made it, here goes the code : `with open(file.ndjson", "rt", encoding="utf-8") as ndjson: for line in ndjson: line_json = json.loads(line) rightPart=line.partition('customLabel\\":\\"')[2] label_value=rightPart.partition('\\"},\\"schema')[0] label_values.append(label_value) print(label_values)` Now label_values isn't full of None: some are, since not every line has a customLabel, but most others contain values, so it's done. Thanks for the json module, that made the difference! – Ipatchy.M Oct 11 '22 at 11:49
@Ipatchy.M Good! I also updated my answer to provide a RegEx based solution for nested objects and removing `None`s from the result of the previous `json` solution. – Momo Oct 11 '22 at 12:32

How to get an unknown substring between two known substrings, within a giant string/file

1 Answers1