-1

I have sentence:

$text = "word word, dr. word: a.sh. word a.k word?!..";

special words are: "dr." , "a.sh" and "a.k"

this :

$text = "word word, dr. word: a.sh. word a.k word?!..";
$split = preg_split("/[^\w]([\s]+[^\w]|$)/", $text, -1, PREG_SPLIT_NO_EMPTY);
print_r($split);

regular expression gives me this:

 Array (   
     [0] => word   
     [1] => word  
     [2] => dr  
     [3] => word    
     [4] => a.sh   
     [5] => word   
     [6] => a.k   
     [7] => word     ) 

and i need

Array (
[0] => word
[1] => word
[2] => dr. #<----- point must be here becouse "dr." is special word [3] => word
[4] => a.sh. #<----- point must be here becouse "a.sh" is special word [5] => word
[6] => a.k
[7] => word)

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Guno
  • 31
  • 1
  • 1
  • 6

1 Answers1

0

I think you're going about this backwards. Instead of trying to define a regular expression that is not a word - define what is a word, and capture all character sequences that match that.

$special_words = array("dr.", "a.sh.", "a.k");
array_walk($special_words, function(&$item, $key){ $item= preg_quote($item, '~');});

$regex = '~(?<!\w)(' . implode('|', $special_words) . '|\w+)(?!\w)~';
$str = 'word word, dr. word: a.sh. word a.k word?!..';
preg_match_all($regex, $str, $matches);
var_dump($matches[0]);

The keys here are an array of special words, the array_walk, and the regular expression.

array_walk

This line, right after your array definition, walks through each of your special words and escapes all of the REGEX special characters (like . and ?), including the delimiter we're going to use later. That way, you can define whatever words you like and you don't have to worry about how it will affect the regular expression.

Regular Expression.

The Regex is actually pretty simple. Implode the special words using a | as glue, then add another pipe and your standard word definition (I chose w+ because it makes the most sense to me.) Surround that giant alternation with parentheses to group it, and I added a lookbehind and a lookahead to ensure we weren't stealing from the middle of a word. Because regex works left to right, the a in a.sh. won't be split off into its own word, because the a.sh. special word will capture it. Unless it says a.sh.e, in which case, each part of the three part expression will match as three separate words.

Check it out.

FrankieTheKneeMan
  • 6,645
  • 2
  • 26
  • 37
  • It works! Thank u! is posible to speed up this code? I will checkout your code. Thank u again – Guno Aug 08 '13 at 20:34
  • @Guno Speed it up? How long is your list of special words? How long is your string? If either is crazy long, you may wish to look into a lexer rather than a home spun regex solution. As is, it runs in less than a tenth of a second. – FrankieTheKneeMan Aug 08 '13 at 21:00
  • word list not will be long, but i will call this function many times. Unforunaly this dose not works when text is in Georgian (utf-8 unicode) – Guno Aug 08 '13 at 21:11
  • There's no reason it can't. And if your text is going to be in unicode, for _God's sake_ mention that in your original question. if you add the `u` modifier to the end of your expression, and set your locale appropriately, you can make it work. – FrankieTheKneeMan Aug 08 '13 at 21:28
  • For instance: Here it is with [Giberrish in the Cyrillic Character set](http://ideone.com/5Xg08x). I would have done it in actual Georgian gibberish, but I couldn't find a Georgian Lorem Ipsum Generator. Sorry. – FrankieTheKneeMan Aug 08 '13 at 21:31
  • now characters look good but special words dose not works for georgian words. fore example: this $text = "საword ა.შ word, dr. word: a.sh. word a.k word?!.."; print_r(divide_a_sentence_into_words($text)); prints this: Array ( [0] => საword [1] => ა [2] => შ [3] => word [4] => dr. [5] => word [6] => a.sh. [7] => word [8] => a.k [9] => word ) – Guno Aug 08 '13 at 21:42
  • this is wrong: [1] => ა [2] => შ – Guno Aug 08 '13 at 21:44
  • this is my function: function divide_a_sentence_into_words($text){ $special_words = array("dr.", "a.sh.", "a.k", "ა.შ."); array_walk($special_words, function(&$item, $key){$item = preg_quote($item, '~');}); $regex = '~(?<!\w)(' . implode('|', $special_words) . '|\w+)(?!\w)~u'; preg_match_all($regex, $text, $matches); return $matches[0]; } – Guno Aug 08 '13 at 21:51
  • IT WORKS FINE!!!!! sorry.... THANK YOU! I cant vote for you – Guno Aug 08 '13 at 21:59