Regular expression to match all non-alphanumerics except apostropes in contractions

Question

I'm trying to tokenize a string of English text such that I can get a sequence of the words without any punctuation, but at the same time I want to leave contractions (like don't and won't) and possessive nouns (like Steve's and Drew's) intact. I'm trying to pull this off using regular expressions, but I'm still new to them.

Basically, I want a regular expression that will match all sequences of non-alphanumeric characters except for apostrophes which are surrounded by alphanumeric characters such as in the examples mentioned previously. Is it possible to do this with regular expressions?

What about [\w']+ or [a-zA-Z']+ to match the words you want, with '. After you match, it depends on your language for details of how the split should return those words. — AngelWarrior, Oct 30 '13 at 03:26

Bohemian · Answer 1 · 2013-10-30T04:02:07.783

0

I don't understand what your regex is trying to match, but I think this will match what you want:

(?i)(?<=^|\s)([a-z]+('[a-z]*)?|'[a-z]+)(?=\s|$)

This matches "words" that may optionally end with an apostrophe followed by 0-n letters, or an apostrophe followed by letters, which matches the following edge cases:

Thing
Jack's
Ross'
'tis

edited Oct 30 '13 at 04:02

answered Oct 30 '13 at 03:38

Bohemian

412,405
93
575
722

score 0 · Answer 2 · answered Oct 30 '13 at 03:40

0

Your question not very clear to me. But If I interpreted correctly, following regex should do the job

\b[\w']+\b

regex101 demo

answered Oct 30 '13 at 03:40

jkshah

11,387
6
35
45

1

This won't match "Ross'" which is the correct spelling for the possessive of a noun ending in "s" (ie you don't write "Ross's"), so you can't use `\b` at the end, nor the front due to "'twas" and "'tis" – Bohemian Oct 30 '13 at 03:59

Regular expression to match all non-alphanumerics except apostropes in contractions

2 Answers2