4

Okay, I was hoping someone could help me with a little regex-fu.

I am trying to clean up a string.

Basically, I am:

  1. Replacing all characters except A-Za-z0-9 with a replacement.

  2. Replacing consecutive duplicates of the replacement with a single instance of the replacement.

  3. Trimming the replacement from the beginning and end of the string.

Example Input:

(&&(%()$()#&#&%&%%(%$+-_The dog jumped over the log*(&)$%&)#)@#%&)&^)@#)

Required Output:

The+dog+jumped+over+the+log

I am currently using this very discombobulated code and just know there is a much more elegant way to accomplish this....

function clean($string, $replace){

    $ok = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    $ok .= $replace;
    $pattern = "/[^".preg_quote($ok, "/")."]/";

    return trim(preg_replace('/'.preg_quote($replace.$replace).'+/', $replace, preg_replace($pattern, $replace, $string)),$replace);
}

Could a Regex-Fu Master please grace me with a simpler/more efficient solution?


A much better solution suggested and explained by Botond Balázs and hakre:

function clean($string, $replace, $skip=""){
    // Escape $skip
    $escaped = preg_quote($replace.$skip, "/");

    // Regex pattern
    // Replace all consecutive occurrences of "Not OK" 
    // characters with the replacement
    $pattern = '/[^A-Za-z0-9'.$escaped.']+/';

    // Execute the regex
    $result = preg_replace($pattern, $replace, $string);

    // Trim and return the result
    return trim($result, $replace);
}
Samantha P
  • 543
  • 4
  • 12

2 Answers2

2

I'm not a "regex ninja" but here's how I would do it.

function clean($string, $replace){
    /// Remove all "not OK" characters from the beginning and the end:
    $result = preg_replace('/^[^A-Za-z0-9]+/', '', $string);
    $result = preg_replace('/[^A-Za-z0-9]+$/', '', $result);

    // Replace all consecutive occurrences of "not OK" 
    // characters with the replacement:
    $result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);

    return $result;
}

I guess this could be simplified more but when dealing with regexes, clarity and readability is often more important than being clever or writing super-optimal code.

Let's see how it works:

  • /^[^A-Za-z0-9]+/:
    • ^ matches the beginning of the string.
    • [^A-Za-z0-9] matches all non-alphanumeric characters
    • + means "match one or more of the previous thing"
  • /[^A-Za-z0-9]+$/:
    • same thing as above, except $ matches the end of the string
  • /[^A-Za-z0-9]+/:
    • same thing as above, except it matches mid-string too

EDIT: OP is right that the first two can be replaced with a call to trim():

function clean($string, $replace){
    // Replace all consecutive occurrences of "not OK" 
    // characters with the replacement:
    $result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);

    return trim($result, $replace);
}
Botond Balázs
  • 2,512
  • 1
  • 24
  • 34
2

I don't want to sound super-clever, but I would not call it regex-foo.

What you do is actually pretty much in the right direction because you use preg_quote, many others are not even aware of that function.

However probably at the wrong place. Wrong place because you quote for characters inside a character class and that has (similar but) different rules for quoting in a regex.

Additionally, regular expressions have been designed with a case like yours in mind. That is probably the part where you look for a wizard, let's see some options how to make your negative character class more compact (I keep the generation out to make this more visible):

[^0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]

There are constructs like 0-9, A-Z and a-z that can represent exactly that. As you can see - is a special character inside a character class, it is not meant literal but as having some characters from-to:

[^0-9A-Za-z]

So that is already more compact and represents the same. There are also notations like \d and \w which might be handy in your case. But I take the first variant for a moment, because I think it's already pretty visible what it does.

The other part is the repetition. Let's see, there is + which means one or more. So you want to replace one or more of the non-matching characters. You use it by adding it at the end of the part that should match one or more times (and by default it's greedy, so if there are 5 characters, those 5 will be taken, not 4):

[^0-9A-Za-z]+

I hope this is helpful. Another step would be to also just drop the non-matching characters at the beginning and end, but it's early in the morning and I'm not that fluent with that.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • 2
    Great explanation. For the asker, I recommend reading the book "Mastering Regular Expressions". It was a real eye-opener for me. – Botond Balázs Nov 18 '12 at 10:48
  • 1
    @BotondBalázs: Very true. As online resource, I find http://www.regular-expressions.info/ not bad as well. Even the PHP manual on the regex syntax is improved nowadays, it was a little sparse in the past: http://www.php.net/manual/en/pcre.pattern.php – hakre Nov 18 '12 at 10:53
  • 1
    as an online (and free) alternative to RegexBuddy, I recommend http://gskinner.com/RegExr/ - though nothing beats RegexBuddy in terms of features :) – Botond Balázs Nov 18 '12 at 10:57
  • Indeed, a very helpful and thorough response. Thank You. About your comment about preg_quote, I need to use it because, as I left out in my question, I have to be able to add "okay" characters on the fly that might be syntax. How and where would it be appropriate to escape with preg_quote? – Samantha P Nov 18 '12 at 10:58
  • As written `preg_quote` is good to know and make use of. I don't think it will give you any problems, just wanted to note, that in some edge-cases it might not quote *exactly* as needed. But that must not mean it will introduce problems. – hakre Nov 18 '12 at 11:02
  • @hakre: Did a little looking - Just for future reference. Is there an alternative to /W that doesn't include underscores? – Samantha P Nov 18 '12 at 11:39
  • @hakre: I think this is a great question. It should get more votes :) – Botond Balázs Nov 18 '12 at 11:46