2

I need to split a string on any non-alphanumeric character except / and -. For example, in preg_split():

/[^a-zA-Z0-9\/\-]/

This works great, but now I want to split the string at all these points except when the characters are found in a URL (i.e. I want to keep the URL together). I consider a URL to be a whitespace-delimited substring that starts with http:// or https://. In other words:

My string. https://my-url.com?q=3 More strings.

Should get split into:

[0] My
[1] string
[2] https://my-url.com?q=3
[3] More
[4] strings

I've tried some naive approaches like /[^a-zA-Z0-9\/\-(https?\:\/\/.\s)]+/ but, unfortunately, I don't know how to do this outside a character class, which obviously is not giving me the results I want.

I am using PHP for now, and I'm hoping to just use preg_split() but I am open to better, more comprehensive ways than this.

Matt
  • 22,721
  • 17
  • 71
  • 112

1 Answers1

2

You can't just stuff things into the character class. Everything will be treated as single characters. What you would want is a negative lookbehind, that ensures, there is no https?:// before your match (separated only by non-whitespace characters). But only .NET supports variable-length lookbehinds. You could reverse the input and pattern and result to work around this, but that's a bit over kill. Just go from splitting to matching:

preg_match_all('~https?://\S*|[a-zA-Z0-9/-]+~', $input, $matches);

Now $matches[0] will contain your desired array.

Working demo.

Note that you can change the delimiter to pretty much anything. This comes in handy, if you have loads of forward slashes, so you don't have to escape them. You also don't need to escape the hyphen if it's the last character in a character class, but in that case whether you do or not is rather a matter of taste.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 1
    Woah! After 5+ years of PHP work I just learned a sublime truth: `preg_split()` breaks the string where the regex matches, and `preg_match_all()` breaks the string where the regex DOESN'T match. This is the function I'm looking for... and it's much simpler. Thanks. – Matt Apr 17 '13 at 19:11
  • 1
    @Matt that's an interesting way to look at it. But it really gets interesting when you consider how captures work in both cases: captures in `preg_match_all` give you substrings of the other strings you get returned (ignoring lookarounds) and `preg_split` gives you substrings of the stuff that **isn't** returned (if you use `PREG_SPLOT_DELIM_CAPTURE`) ;) – Martin Ender Apr 17 '13 at 19:18