Split string on non-alphanumerics in PHP? Is it possible with php's native function?

Question

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.

Example:
$string = 'php_php-php php'; $splitArr = preg_split('/[^a-z0-9]/i', $string);

But there are two problems that I see with this approach.

It is not a native php function, and is totally dependent on the PCRE Library running on server.
An equally important problem is that what if I have punctuation in a word
Example:
$string = 'U.S.A-men's-vote'; $splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
But I want it as [{U.S.A}{men's}{vote}]

So my question is that:

How can we split them according to words?
Is there a possibility to do it with php native function or in some other way where we are not dependent?

Regards

What is your definition of a word? It is allowed periods? What about something like `this sentence.and this one too.`? And what about `I am sure this regex is a no-go but I'll use it anyway.` — LeonardChallis, Oct 24 '12 at 10:47
It depends on what you define as "word". For `U.S.A` to be a word, you'd need a non-space-padded stop mark to not be a word separator. So you could split on whitespaces, question marks, commas, colons, and so on, OR spaced stop marks. — LSerni, Oct 24 '12 at 10:47
It is possible. Iterate over the string (char by char) and apply your own rules whether the char belongs to a word or not. — Yoshi, Oct 24 '12 at 10:48
`preg_split` is not native? Please show me a PHP installation since the late 1920s that does not support `preg_split` — Lightness Races in Orbit, Oct 24 '12 at 10:52
So you're also not using `mysql`/`mysqli`/`PDO`, because these are *extensions*? What about `mb_*`? You just have to be realistic at some point... — deceze, Oct 24 '12 at 11:04
@LeonardChallis That is the real issue. Actually I am making a search using google and bing api. The api sometimes return URLs that we don't need. Like it will send me a link that has the word I searched for on the page but not in its url. Now I only want to get urls that has the word I searched for in the URL. The problem is that if a user searched for USA and the url contains U.S.A or vice-versa, we have a problem. I thought that there might be a solution where a word is pre-defined in php or some other library. I guess that it is not. — Jehanzeb.Malik, Oct 24 '12 at 11:16
"word" is a predefined concept in PHP and other regex libraries: any collection of alphanumeric characters and underscores. In your case you have a non-standard definition that fits your particular problem, which is why there is nothing pre-made that does it. — dan1111, Oct 24 '12 at 11:24
@dan1111 Yes I agree with you. But try explaining that to a non-technical client. — Jehanzeb.Malik, Oct 24 '12 at 11:26

score 3 · Answer 1 · answered Oct 24 '12 at 10:47

Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.

Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:

preg_split('/[^a-z0-9.\']+/i', $string);

If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:

preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);

Well, it *is* possible to have a PHP installation without `preg_*` functions enabled. In practice it just doesn't really happen. — deceze, Oct 24 '12 at 10:53

Mark Baker · Accepted Answer · 2012-10-24T11:03:46.203

3

Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"

edited Oct 24 '12 at 11:03

answered Oct 24 '12 at 10:58

Mark Baker

209,507
32
346
385

Thanks Mark. I think considering my situation this will give me the closest to best results. Not 100% accurate but almost there. – Jehanzeb.Malik Oct 24 '12 at 11:21

score 1 · Answer 3 · answered Oct 24 '12 at 10:59

As per my comment, you might want to try (add as many separators as needed)

$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);

You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).

So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling

they 're 'just friends'. Or that's what they say.

while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.

score 0 · Answer 4 · answered Oct 24 '12 at 10:51

This is not a php-problem, but a logical one.

Words could be concatenated by a -. Abbrevations could look like short sentences.

You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

Split string on non-alphanumerics in PHP? Is it possible with php's native function?

4 Answers4

Linked