3

I have the following regex in Java:

String regex = "[^\\s\\p{L}\\p{N}]";
Pattern p = Pattern.compile(regex);

String phrase = "Time flies: "when you're having fun!" Can't wait, 'until' next summer :)";
String delimited = p.matcher(phrase).replaceAll("");

Right now this regex removes all non-spaces and nonAlphanumerics.

Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when youre having fun Cant wait until next summer

Problem is, I want to maintain the single quotes on words, such as you're, can't, etc. But want to remove single quotes that are at the end of a sentence, or surround a word, such as 'hello'. This is what I want:

Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when you're having fun Can't wait until next summer

How can I update my current regex to be able to do this? I need to keep the \p{L} and \p{N} as it has to work for more than one language.

Thanks!

Jimmy Lee
  • 215
  • 2
  • 12
  • `\\p{L}\\p{N}` is likely part of the issue; what is wrong with indicating `"` or `'` individually? – l'L'l Jul 20 '17 at 03:09
  • Because it would remove single quotes altogether. I want to keep them for possessives like "you're" or "johnson's" – Jimmy Lee Jul 20 '17 at 03:11
  • Not entirely... doing a pattern such as `^[\"]|[\\s][\"']|[\"'][\\s]|[\"]$` might work (it could need a bit of adjusting since I'm doing this without testing). Then replace with `\s`... – l'L'l Jul 20 '17 at 03:14
  • 4
    For what it is worth "can't" and "you're" are not possessives, they are contractions. The thing they all have in common is a single quote with characters on either side. Possessives may or may not have a character after the single quote. – Kevin Jul 20 '17 at 03:16
  • Just use word boundaries before and after the quote mark. – Dawood ibn Kareem Jul 20 '17 at 03:19
  • @DawoodibnKareem how would I go about that? I'm new to regex. – Jimmy Lee Jul 20 '17 at 03:23
  • @DawoodibnKareem No good. Not counting the ends of the string, a word boundary is a place between an word and a non-word character (in either order). Both a space and a quote mark are non-word characters – ajb Jul 20 '17 at 03:24
  • @Kevin Sorry i was just using those as examples. exactly, anything with a character, single quote, character I need to keep – Jimmy Lee Jul 20 '17 at 03:24
  • @ajb Good point. In that case, I'd use lookaheads and lookbehinds. I'll try to find the time to put an answer together. – Dawood ibn Kareem Jul 20 '17 at 03:25
  • Do I need to point out that there are possessive forms that _end_ with an apostrophe, such as _James'_ ? – Dawood ibn Kareem Jul 20 '17 at 03:26
  • You need to consider when the quote is at the very beginning or end of the line as well, so that more or less throws a curve ball into the fray. – l'L'l Jul 20 '17 at 03:26
  • @DawoodibnKareem no don't worry about that scenario. Thank you so much for giving it a shot! – Jimmy Lee Jul 20 '17 at 03:30
  • Don't thank me - I haven't done it yet. This is a really good question, by the way. I can't believe I'm the only person (so far) who has upvoted it. – Dawood ibn Kareem Jul 20 '17 at 03:31
  • And now, I'm not going to bother, because @ajb's excellent answer is more-or-less what I would have written (although I'm sure I wouldn't have done nearly as good a job as ajb has done). Off to buy some fish 'n' chips. – Dawood ibn Kareem Jul 20 '17 at 03:35
  • @DawoodibnKareem If you can come up with a regex that handles "fish 'n' chips" correctly, you win. – ajb Jul 20 '17 at 03:35
  • Hahaha, you guys are great. Thanks ajb for the solutions and Dawood for almost beating him ;) – Jimmy Lee Jul 20 '17 at 03:36
  • @ajb I don't think it can be done. How would you know which quotes to remove from a sentence like `"The 'n' in 'fish 'n' chips' is short for 'and'."` ? – Dawood ibn Kareem Jul 20 '17 at 03:39
  • @DawoodibnKareem The one after "fish", naturally. :) Actually, I think it will be doable once we have a regex syntax that supports NLP. :) – ajb Jul 20 '17 at 04:22
  • This is not a problem that can be completely solved with regex. Regex is acceptable only if you code special cases or can live with occasional incorrect results. – Jim Garrison Jul 20 '17 at 04:43

1 Answers1

3

This should do what you want, or come close:

String regex = "[^\\s\\p{L}\\p{N}']|(?<=(^|\\s))'|'(?=($|\\s))";

The regex has three alternatives separated by |. It will match:

  1. Any character that is not a space, letter, number, or quote mark.
  2. A quote mark, if it is preceded by the beginning of the line or a space (therefore, a quote mark at the beginning of a word). This uses positive lookbehind.
  3. A quote mark, if it is followed by the end of the line or a space (therefore, a quote mark at the end of the word). This uses positive lookahead.

It works on the example you give. Where it might not work the way you want is if you have a word with a quote mark on one side, but not the other: "'Tis a shame that we couldn't visit James' house". Since the lookahead/behind only look at the character right before and after the quote, and doesn't look ahead to see if (say) the quote mark at the beginning of the word is followed by a quote mark at the end of the word, it will delete the quote marks on 'Tis and James'.

ajb
  • 31,309
  • 3
  • 58
  • 84