filter non-alphanumeric "repeating" characters

Question

What's the best way to filter non-alphanumeric "repeating" characters

I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.

Examples:

...........

*****************

!!!!!!!! 

########### 

------------------

~~~~~~~~~~~~~

Special case patterns:

=*=*=*=*=*=

->->->->

Do they have to be the same character repeating? e.g. ?!?! would not be filtered? What do you want them to be replaced with? What should happen in the special case patterns you listed. — Jacob, Mar 10 '11 at 23:59
- What do you want them to be replaced with? With the same char except not so many "======" would be "==" The same for the special patterns, they would just be reduced too So "?!?!?!?" would be "?!" — isuelt, Mar 11 '11 at 00:14

score 1 · Answer 1 · 2011-03-11T00:51:58.310

The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g

edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].

So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.

score 1 · Answer 2 · answered Mar 11 '11 at 00:45

1

Based on @sln answer:

$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);

answered Mar 11 '11 at 00:45

Alix Axel

151,645
95
393
500

score 0 · Answer 3 · answered Mar 10 '11 at 23:54

0

preg_replace('~\W+~', '', $str);

answered Mar 10 '11 at 23:54

zerkms

249,484
69
436
539

This is basically what i came up with. If you want to replace with the same character then `preg_match('/(\W+)/', '$1', $str);` – Jonathan Kuhn Mar 11 '11 at 00:41
Matches `_`, which is non alphanumeric. – Alix Axel Mar 11 '11 at 00:41
so use `[\W_]` instead of just `\W` – Jonathan Kuhn Mar 11 '11 at 00:46

score 0 · Answer 4 · answered Mar 11 '11 at 02:29

0

sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.

answered Mar 11 '11 at 02:29

ridgerunner

33,777
5
57
69

score 0 · Answer 5 · answered Mar 11 '11 at 02:36

0

This will filter out all symbols

[code] $q = ereg_replace("[^A-Za-z0-9 ]", "", $q); [/code]

answered Mar 11 '11 at 02:36

hozza

619
2
12
27

score 0 · Answer 6 · answered Mar 11 '11 at 03:59

replace(/([^A-Za-z0-9\s]+)\1+/, "")

will remove repeated patterns of non-alphanumeric non-whitespace strings.

However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.

The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.

You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.

Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.

score 0 · Answer 7 · answered Mar 20 '11 at 10:03

0

This works for me: preg_replace('/(.)\1{3,}/i', '', $sourceStr); It removes all the symbols that repats 3+ times in row.

answered Mar 20 '11 at 10:03

Andrey

219
1
2
6

filter non-alphanumeric "repeating" characters

7 Answers7