tokenize sentence into words python

Question

I want to extract information from different sentences so i'm using nltk to divide each sentence to words, I'm using this code:

words=[]
for i in range(len(sentences)):
    words.append(nltk.word_tokenize(sentences[i]))
    words

it works pretty good but i want something little bit different .. for example i have this sentence : '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']' i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)" to be one word and not divided to several single words .

UPDATE: i want something like that:

[
 'Jan',
 '31',
 '19:28:14',
 'nginx',
 '10.0.0.0',
 '31/Jan/2019:19:28:14',
 '+0100',
 'POST',
 '/test/itf/',
 'HTTP/x.x',
 '404',
 '146',
 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']

any idea to make it possible !? Thank you in advance

This is not a natural sentence, it is a log line. What about a regex? `m = re.search(r'.*"(.*)"', sentences[i])` and then `if m:` `words.append(m.group(1))`? If you need other "words" from this "sentence", please clarify. — Wiktor Stribiżew, Jan 14 '22 at 12:39
@WiktorStribiżew thank you so much , it works but i want the other words to be tokenized + what u did , is it possible? or should i use my methode to tokenize evething into words and then add ur code and that way i will have a list which contains words (some are interresting and some i will ignore them + the user_agent as a word) what do u think? — Hermoine, Jan 14 '22 at 12:56
You can combine it, but I am not sure if this is not going to work for you, since you will have a list of tokens and a string. See https://ideone.com/xwT8PT — Wiktor Stribiżew, Jan 14 '22 at 12:57
@Chris i did an update , i think it's clear now :D thank u for ur time — Hermoine, Jan 14 '22 at 13:02
@WiktorStribiżew Thank youuuu so much , it worked .. can you probably add it as a solution , that way i can approve it :D have a good day — Hermoine, Jan 14 '22 at 13:49
Not a big issue, `\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"` will work with any time format. — Wiktor Stribiżew, Jan 14 '22 at 23:39

Wiktor Stribiżew · Accepted Answer · 2022-01-15T20:39:39.320

3

You can import re and parse the log line (which is not a natural language sentence) with a regex:

import re

sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']

rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')

words=[]
for sent in sentences:
    m = rx.search(sent)
    if m:
        words.append(list(m.groups()))
    else:
        words.append(nltk.word_tokenize(sent))

print(words)

See the Python demo.

The output will look like

[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]

edited Jan 15 '22 at 20:39

answered Jan 14 '22 at 14:29

Wiktor Stribiżew

607,720
39
448
563

1

Another regex for `1:2:12` time: `\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"` (see [demo](https://regex101.com/r/EKRcjg/1)). – Wiktor Stribiżew Jan 14 '22 at 23:39
I have another question regarding that , the solution works pretty good.. i use also BERT model for that , i trained the model firstly using ur way for user_agent but when i want to see the prediction , prediction, model_output = model.predict(sentence) i get the user agent divided again .. do u have any idea about the reason? thank you in advance – Hermoine Jan 17 '22 at 13:42
@Hermoine The text tokenization at training must match the text tokenization at inference. – Wiktor Stribiżew Jan 17 '22 at 13:54
yes, it's the same text .. does probably this line the reason for it : model = NERModel('bert', 'bert-base-cased',labels=label,use_cuda=False,args =args) using only ner without regex will cause that , so i used ur method for my dataframe, but the question is how can i use it also for train the model !! – Hermoine Jan 17 '22 at 14:10

score 0 · Answer 2 · answered Jan 14 '22 at 12:59

First you need to chose to use " or ' because the both are unusual and can to cause any strange behavior. After that is just string formating:

s='"[\"Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\"]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'

words = s.split(' ') # break the sentence into spaces
# ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']

# then access your data list
words[0] # '"["Jan'
words[1] # '31'
words[2] # '19:28:14'

it gives me the same output , that my code gives me .. i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)" ad a single word and not divided into multiple words :D — Hermoine, Jan 14 '22 at 13:03

Chris · Answer 3 · 2022-01-17T08:37:21.193

You could do that using parition() with space delimiter, regex and recursion, as below. I have to say though, this solution is strict to the string format you provided.

import re
s_list = []

def str_partition(text):
    parts = text.partition(" ")
    part = re.sub('[\[\]\"\'\-]', '', parts[0])
    
    if part.startswith("nginx"):
        s_list.append(part.replace(":", ''))
    elif part != "":
        s_list.append(part)
        
    if not parts[2].startswith('"Moz'):
        str_partition(parts[2])
    else:
        part = re.sub('[\"\']', '', parts[2])
        part = part[:-1]
        s_list.append(part)
        return

s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'     
str_partition(s)       
print(s_list)

Output:

['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']

tokenize sentence into words python

3 Answers3