1

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.

Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);

But there are two problems that I see with this approach.

  1. It is not a native php function, and is totally dependent on the PCRE Library running on server.
  2. An equally important problem is that what if I have punctuation in a word
    Example:
    $string = 'U.S.A-men's-vote';
    $splitArr = preg_split('/[^a-z0-9]/i', $string);

    Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
    But I want it as [{U.S.A}{men's}{vote}]

So my question is that:

  • How can we split them according to words?
  • Is there a possibility to do it with php native function or in some other way where we are not dependent?

Regards

Jehanzeb.Malik
  • 3,332
  • 4
  • 25
  • 41
  • 2
    What is your definition of a word? It is allowed periods? What about something like `this sentence.and this one too.`? And what about `I am sure this regex is a no-go but I'll use it anyway.` – LeonardChallis Oct 24 '12 at 10:47
  • It depends on what you define as "word". For `U.S.A` to be a word, you'd need a non-space-padded stop mark to not be a word separator. So you could split on whitespaces, question marks, commas, colons, and so on, OR spaced stop marks. – LSerni Oct 24 '12 at 10:47
  • It is possible. Iterate over the string (char by char) and apply your own rules whether the char belongs to a word or not. – Yoshi Oct 24 '12 at 10:48
  • 3
    `preg_split` is not native? Please show me a PHP installation since the late 1920s that does not support `preg_split` – Lightness Races in Orbit Oct 24 '12 at 10:52
  • 1
    So you're also not using `mysql`/`mysqli`/`PDO`, because these are *extensions*? What about `mb_*`? You just have to be realistic at some point... – deceze Oct 24 '12 at 11:04
  • @LeonardChallis That is the real issue. Actually I am making a search using google and bing api. The api sometimes return URLs that we don't need. Like it will send me a link that has the word I searched for on the page but not in its url. Now I only want to get urls that has the word I searched for in the URL. The problem is that if a user searched for USA and the url contains U.S.A or vice-versa, we have a problem. I thought that there might be a solution where a word is pre-defined in php or some other library. I guess that it is not. – Jehanzeb.Malik Oct 24 '12 at 11:16
  • "word" is a predefined concept in PHP and other regex libraries: any collection of alphanumeric characters and underscores. In your case you have a non-standard definition that fits your particular problem, which is why there is nothing pre-made that does it. – dan1111 Oct 24 '12 at 11:24
  • @dan1111 Yes I agree with you. But try explaining that to a non-technical client. – Jehanzeb.Malik Oct 24 '12 at 11:26

4 Answers4

3

Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.

Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:

preg_split('/[^a-z0-9.\']+/i', $string);

If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:

preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Well, it *is* possible to have a PHP installation without `preg_*` functions enabled. In practice it just doesn't really happen. – deceze Oct 24 '12 at 10:53
3

Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Thanks Mark. I think considering my situation this will give me the closest to best results. Not 100% accurate but almost there. – Jehanzeb.Malik Oct 24 '12 at 11:21
1

As per my comment, you might want to try (add as many separators as needed)

$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);

You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).

So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling

they 're 'just friends'. Or that's what they say.

while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.

LSerni
  • 55,617
  • 10
  • 65
  • 107
0

This is not a php-problem, but a logical one.

Words could be concatenated by a -. Abbrevations could look like short sentences.

You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

Ron
  • 1,336
  • 12
  • 20