7

I have the following text:

I don't like to eat Cici's food (it is true)

I need to tokenize it to

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

I have found out that the following regex expression (['()\w]+|\.) splits like this:

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

How do I take the parenthesis out of the token and make it to an own token?

Thanks for ideas.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Jürgen K.
  • 3,427
  • 9
  • 30
  • 66

2 Answers2

6

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You should separate singular char tokens (the brackets in this particular case) from the chars which represent a token in series:

([().]|['\w]+)

Demo: https://regex101.com/r/RQfYhL/2

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40