3

Log file:

INFO:werkzeug:127.0.0.1 - - [20/Sep/2018 19:40:00] "GET /socket.io/?polling HTTP/1.1" 200 -
INFO:engineio: Received packet MESSAGE, ["key",{"data":{"tag1":12,"tag2":13,"tag3": 14"...}}]

I'm interested in extracting only the text from with in the brackets which contain the keyword "key" and not all of the occurrences that match the regex pattern from below.

Here is what I have tried so far:

import re
with open('logfile.log', 'r') as text_file:
    matches = re.findall(r'\[([^\]]+)', text_file.read())
    with open('output.txt', 'w') as out:
        out.write('\n'.join(matches))

This outputs all of the occurrences that match the regex. The desired output to the output.txt would look like this:

"key",{"data":{"tag1":12,"tag2":13,"tag3": 14"...}}
  • WIll all the messages you want to extract contain *"key"*, or is that just an example? How much structure can be assumed for the output? – JohanL Sep 21 '18 at 16:49
  • Yes, the desired extracted messages will contain the same keyword "key". As far as output structure, it should contain all of the text inside of the square brackets from the example log file snippet from above. – spinState010 Sep 21 '18 at 16:53
  • Try `print(re.findall(r'\[([^][]*"key"[^][]*)]', text_file.read()))` if `"key"` can appear anywhere inside square brackets. – Wiktor Stribiżew Sep 21 '18 at 16:57
  • Then you can make that part of the regex you are looking for: `re.findall(r'\["key"([^\]]+)', text_file.read())`. Is that what you are looking for? – JohanL Sep 21 '18 at 16:57
  • @JohanL I tried that and it didn't seem to work, although it was in the right direction. Thanks for the reply! – spinState010 Sep 21 '18 at 17:55
  • Ah, you probably have a `*` in front of your `key` phrase (which makes it bold when written as text here. If you want to catch that as well, it would be `re.findall(r'\[\*"key"\*([^\]]+)', text_file.read())` or you can of course use the more general search for `key` as in the accepted answer. – JohanL Sep 21 '18 at 18:26
  • @JohanL Sorry about any confusion the bold text may have caused. I just wanted to emphasize the word 'key'. vash_the_stampede, Has already taken the liberty to edit my post and remove the bold face font. – spinState010 Sep 21 '18 at 22:48

1 Answers1

2

To match text within square brackets that cannot have [ and ] inside it, but should contain some other text can be matched with a [^][] negated character class.

That is, you may match the whole text within square brackets with \[[^][]*], and if you need to match some text inside, you need to put that text after [^][]* and then append another occurrence of [^][]* before the closing ].

You may use

re.findall(r'\[([^][]*"key"[^][]*)]', text_file.read()) 

See the Python demo:

import re
s = '''INFO:werkzeug:127.0.0.1 - - [20/Sep/2018 19:40:00] "GET /socket.io/?polling HTTP/1.1" 200 - 
INFO:engineio: Received packet MESSAGE, ["key",{"data":{"tag1":12,"tag2":13,"tag3": 14"...}}]'''
print(re.findall(r'\[([^][]*"key"[^][]*)]', s)) 

Output:

['"key",{"data":{"tag1":12,"tag2":13,"tag3": 14"...}}']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! This worked perfectly! Just to elaborate on a generalized way to match the instance of when we have *key1* or *key2*. matches = re.findall(r'\[([^][]*"key.*"[^][]*)]', text_file.read()) – spinState010 Sep 21 '18 at 18:11
  • @spinState010 It may be `key[12]` or `key\d+` instead of `key`. – Wiktor Stribiżew Sep 21 '18 at 18:15