8

I'm trying to remove all characters from a string except for #, @, :), :(. Example:

this is, a placeholder text. I wanna remove symbols like ! and ? but keep @ & # & :)

should result in (after removing the matched results):

this is a placeholder text I wanna remove symbols like  and  but keep @  #  :)

I tried:

(?! |#|@|:\)|:\()\W

It is working but in the case of :) and :(, : is still being matched. I know that it's matching because it's checking every character and the previous ones, e.g: :) matches only : but :)) matches :).

Emma
  • 27,428
  • 11
  • 44
  • 69
mahmoudafer
  • 1,139
  • 3
  • 14
  • 30
  • 1
    Can you provide an example string from which you want to remove/to keep certain characters? – PinkBanter May 11 '19 at 15:21
  • 1
    You could just extract those sequences instead of selecting everything else. – ssc-hrep3 May 11 '19 at 15:24
  • 1
    You do not actually need to use lookarounds in case you know exactly your exceptions. Use capturing mechanism, see [this answer](https://stackoverflow.com/a/56093282/3832970) showing how. – Wiktor Stribiżew May 11 '19 at 18:45

4 Answers4

7

This is a tricky question, because you want to remove all symbols except for a certain whitelist. In addition, some of the symbols on the whitelist actually consist of two characters:

:)
:(

To handle this, we can first spare both colon : and parentheses, then selectively remove either one should it not be part of a smiley or frown face:

input = "this is, a (placeholder text). I wanna remove symbols like: ! and ? but keep @ & # & :)"
output = re.sub(r'[^\w\s:()@&#]|:(?![()])|(?<!:)[()]', '', input)
print(output)

this is a placeholder text I wanna remove symbols like  and  but keep @ & # & :)

The regex character class I used was:

[^\w\s:()@&#]

This will match any character which is not a word or whitespace character. It also spares your whitelist from the replacement. In the other two parts of the alternation, we then override this logic, by removing colon and parentheses should they not be part of a smiley face.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
5

As others have shown, it is possible to write a regex that will succeed the way you have framed the problem. But this is a case where it's much simpler to write a regex to match what you want to keep. Then just join those parts together.

import re

rgx = re.compile(r'\w|\s|@|&|#|:\)|:\(')
orig = 'Blah!! Blah.... ### .... #@:):):) @@ Blah! Blah??? :):)#'
new = ''.join(rgx.findall(orig))
print(new)
FMc
  • 41,963
  • 13
  • 79
  • 132
2

You can try the following regex (for Python).

(\w|:\)|:\(|#|@| )

With this fake sentence:

"I want to remove certain characters but want to keep certain ones like #random, and :) and :( and something like @.

If it is found in another sentence, :), do search it :( "

It finds all the characters you mentioned in the question. You can use it to find the string that contains it and write rules to carefully remove other punctuation from this string.

mahmoudafer
  • 1,139
  • 3
  • 14
  • 30
PinkBanter
  • 1,686
  • 5
  • 17
  • 38
1

You may also use a simple approach: match and capture what you need to "exclude" from match and just match what you want to remove, and then just use a backreference to the capture group value:

re.sub(r'([#@\s]|:[)(])|\W', r'\1', s)
#        ^---Group 1--^->->->->^^         

See the regex demo. Here, ([#@\s]|:[)(]) matches and captures into Group 1 a #, @, whitespace chars or :( or :( substrings and \W matches without capturing any non-word char.

See Python demo:

import re
s="this is, a placeholder text. I wanna remove symbols like ! and ? but keep @ & # & :)"
print(re.sub(r'([#@\s]|:[)(])|\W', r'\1', s))
# => this is a placeholder text I wanna remove symbols like  and  but keep @  #  :)

In Python versions before 3.5, use a lambda experession as the replacement argument (due to a bug):

re.sub(r'([#@\s]|:[)(])|\W', lambda x: x.group(1) if x.group(1) else '', s)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563