0

due to a faulty server design, I'm having to stream down JSON and correct a null byte if I find one. I'm using python requests to do this. Each JSON event is delimited by a \n. What I am trying to do here is pull down a chunk (which will always be less than one log line). Search through that chunk for the end of event signifier ("\"status\":\d+\d+\d+}}\n").

If that signifier is there I will do something with the full JSON event, if not, I add that chunk to a buffer, b, then grab the next chunk and look for the identifier. As soon as I get this down, I'll start searching for the null byte.

b = ""

for d in r.iter_content(chunk_size=25):

    s = re.search("\"status\":\d+\d+\d+}}\n", d)

    if s:
        d = d.split("\n", 1)
        fullLogLine = b + d[0]
        b = d[1]
    else:
        b = b + d

I'm completely losing the value of b in this case. It doesn't seem to carry over through the iter_content. Whenever I try to print the value of just b it's empty. I feel I'm missing something obvious here. Anything helps. Thanks.

HectorOfTroy407
  • 1,737
  • 5
  • 21
  • 31
  • 1
    For starters, what's the meaning of `\d+\d+\d+` - that's `\d+` disguising in a way to break some regex engines. – zwer Jun 13 '17 at 23:30

1 Answers1

2

First of all, that regex is messed up \d+ means 'one or more digits' so why chain three of them together? Also, you need to use 'raw string' for this sort a pattern as \ is treated as an escape character so your pattern doesn't get built properly. You'd want to change it to re.search(r'"status":\d+}}', d).

Secondly, your d.split() line can pick up a wrong \n if there are two newlines in your chunk.

You don't even need regex for this, good ol' Python string search/slicing is more than enough to ensure you get your delimiters right:

logs = []  # store for our individual entries
buffer = []  # buffer for our partial chunks
for chunk in r.iter_content(chunk_size=25):  # read chunk-by-chunk...
    eoe = chunk.find("}}\n")  # seek the guaranteed event delimiter
    while eoe != -1:  # a potential delimiter found, let's dig deeper...
        value_index = chunk.rfind(":", 0, eoe)  # find the first column before it
        if eoe-1 >= value_index >= eoe-4:  # woo hoo, there are 1-3 characters between
            try:  # lets see if it's a digit...
                status_value = int(chunk[value_index+1:eoe])  # omg, we're getting there...
                if chunk[value_index-8:value_index] == '"status"':  # ding, ding, a match!
                    buffer.append(chunk[:eoe+2])  # buffer everything up to the delimiter
                    logs.append("".join(buffer))  # flatten the buffer and write it to logs
                    chunk = chunk[eoe + 3:]  # remove everything before the delimiter
                    eoe = 0  # reset search position
                    buffer = []  # reset our buffer
            except (ValueError, TypeError):  # close but no cigar, ignore
                pass  # let it slide...
        eoe = chunk.find("}}\n", eoe + 1)  # maybe there is another delimiter in the chunk...
    buffer.append(chunk)  # add the current chunk to buffer
if buffer and buffer[0] != "":  # there is still some data in the buffer
        logs.append("".join(buffer))  # add it, even if not complete...

# Do whatever you want with the `logs` list...

It looks complicated but it's actually quite easy if you read it line by line, and you'll have to do some of these complexities (overlapping matches and such) with a regex match, too (to account for potential multiple event delimiters in the same chunk).

zwer
  • 24,943
  • 3
  • 48
  • 66