Tokenize by using regular expressions (parenthesis)

Question

I have the following text:

I don't like to eat Cici's food (it is true)

I need to tokenize it to

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

I have found out that the following regex expression (['()\w]+|\.) splits like this:

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

How do I take the parenthesis out of the token and make it to an own token?

Thanks for ideas.

Do you plan to split or match these tokens? It might be easier to match them with [`\w+(?:'\w+)?|[^\w\s]`](https://regex101.com/r/kYcsPD/1). — Wiktor Stribiżew, Mar 29 '17 at 12:04
what is the difference between split and match? To sum up the problem what i need is (foo) -> ["(", "foo", ")"]) — Jürgen K., Mar 29 '17 at 12:07
I mean what programming language are you using the pattern in? — Wiktor Stribiżew, Mar 29 '17 at 12:19
There are some quotation marks missing. Why findall?I Need to split the sentence in tokens — Jürgen K., Mar 29 '17 at 12:23
Sorry, the double quoted string literal must be used, I edited the comment. It does *tokenize* the string. Just test and you will see. `w+(?:'\w+)?` will match all 1+ word char chunks followed with an optional `'` followed with 1+ word char substrings, and `[^\w\s]` will match a single char other than word and whitespace characters. — Wiktor Stribiżew, Mar 29 '17 at 12:26
well, works fine thx. So could you tell me which expression i need only for (foo) -> ["(", "foo", ")"]? I'm trying to understand what you have done — Jürgen K., Mar 29 '17 at 12:43
Only for `(foo)` - `re.findall(r'\w+|\W', s)` - match 1 or more word chars (`\w+`), or (`|`) 1 non-word char (`\W`). But if you plan to avoid matching whitespaces (that can be matched with `\W`) you need to exclude them from the pattern using `[^\w\s]`. It is a kind of a contrast principle with exceptions. I will post an answer. — Wiktor Stribiżew, Mar 29 '17 at 12:49
I added two solutions in my answer, if there is anything unclear, please let me know. — Wiktor Stribiżew, Mar 29 '17 at 13:01
yes, how does re.findall(r'\w+|\W', s) look like with avoiding whitespaces is not clear — Jürgen K., Mar 29 '17 at 13:04
`\W` matches whitespace. To subtract the `\s` from `\W`, you need to convert `\W` to the negated character class `[^\w]` (matching any char but a word char) and add `\s` to it - `[^\w\s]` that matches any char but a word *and* whitespace chars. — Wiktor Stribiżew, Mar 29 '17 at 13:05
No idea why you used just that, see https://ideone.com/RZTxmI. Read my answer below. — Wiktor Stribiżew, Mar 29 '17 at 13:15
What do you mean? It matches `(`, `foo` and `)`. [Look here](https://ideone.com/fS2QIq). — Wiktor Stribiżew, Mar 29 '17 at 13:21

score 6 · Accepted Answer · answered Mar 29 '17 at 12:57

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

score 0 · Answer 2 · answered Mar 29 '17 at 12:04

0

You should separate singular char tokens (the brackets in this particular case) from the chars which represent a token in series:

([().]|['\w]+)

Demo: https://regex101.com/r/RQfYhL/2

answered Mar 29 '17 at 12:04

Dmitry Egorov

9,542
3
22
40

Tokenize by using regular expressions (parenthesis)

2 Answers2