1

What's the best way to filter non-alphanumeric "repeating" characters

I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.

Examples:

...........

*****************

!!!!!!!! 

########### 

------------------

~~~~~~~~~~~~~

Special case patterns:

=*=*=*=*=*=

->->->->
isuelt
  • 333
  • 1
  • 4
  • 10
  • 1
    Do they have to be the same character repeating? e.g. ?!?! would not be filtered? What do you want them to be replaced with? What should happen in the special case patterns you listed. – Jacob Mar 10 '11 at 23:59
  • - What do you want them to be replaced with? With the same char except not so many "======" would be "==" The same for the special patterns, they would just be reduced too So "?!?!?!?" would be "?!" – isuelt Mar 11 '11 at 00:14

7 Answers7

1

The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g

edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].

So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.

1

Based on @sln answer:

$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
0
preg_replace('~\W+~', '', $str);
zerkms
  • 249,484
  • 69
  • 436
  • 539
0

sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
0

This will filter out all symbols

[code] $q = ereg_replace("[^A-Za-z0-9 ]", "", $q); [/code]

hozza
  • 619
  • 2
  • 12
  • 27
0
replace(/([^A-Za-z0-9\s]+)\1+/, "")

will remove repeated patterns of non-alphanumeric non-whitespace strings.

However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.

The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.

You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.

Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.

Stephen Chung
  • 14,497
  • 1
  • 35
  • 48
0

This works for me: preg_replace('/(.)\1{3,}/i', '', $sourceStr); It removes all the symbols that repats 3+ times in row.

Andrey
  • 219
  • 1
  • 2
  • 6