0

I'm trying to tokenize a string of English text such that I can get a sequence of the words without any punctuation, but at the same time I want to leave contractions (like don't and won't) and possessive nouns (like Steve's and Drew's) intact. I'm trying to pull this off using regular expressions, but I'm still new to them.

Basically, I want a regular expression that will match all sequences of non-alphanumeric characters except for apostrophes which are surrounded by alphanumeric characters such as in the examples mentioned previously. Is it possible to do this with regular expressions?

  • What about [\w']+ or [a-zA-Z']+ to match the words you want, with '. After you match, it depends on your language for details of how the split should return those words. – AngelWarrior Oct 30 '13 at 03:26

2 Answers2

0

I don't understand what your regex is trying to match, but I think this will match what you want:

(?i)(?<=^|\s)([a-z]+('[a-z]*)?|'[a-z]+)(?=\s|$)

This matches "words" that may optionally end with an apostrophe followed by 0-n letters, or an apostrophe followed by letters, which matches the following edge cases:

  • Thing
  • Jack's
  • Ross'
  • 'tis
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Your question not very clear to me. But If I interpreted correctly, following regex should do the job

\b[\w']+\b

regex101 demo

jkshah
  • 11,387
  • 6
  • 35
  • 45
  • 1
    This won't match "Ross'" which is the correct spelling for the possessive of a noun ending in "s" (ie you don't write "Ross's"), so you can't use `\b` at the end, nor the front due to "'twas" and "'tis" – Bohemian Oct 30 '13 at 03:59