-1

I have this text:

A man’s jacket is of green color. He – the biggest star in modern history – rides bikes very fast (230 km per hour). How is it possible?! What kind of bike is he using? The semi-automatic gear of his bike, which is quite expensive, significantly helps to reach that speed. Some (or maybe many) claim that he is the fastest in the world! “I saw him ride the bike!” Mr. John Deer speaks. “The speed he sets is 133.78 kilometers per hour,” which sounds incredible; sounds deceiving.

I want to have the following resulting array:

words[1] = "A"
words[2] = "man's"
words[3] = "jacket"
...
words[n+1] = "color"
words[n+2] = "."
words[n+3] = "He"
words[n+4] = "-"
words[n+5] = "the"
...

This array should include all words and punctuation marks separately. Can that be performed using regexp? Can anyone help to compose it? Thanks!

EDIT: based on request to show my work. I'm processing the text using the following function, but I want to do the same in regex:

$text = explode(' ', $this->rawText);
$marks = Array('.', ',', ' ?', '!', ':', ';', '-', '--', '...');
for ($i = 0, $j = 0; $i < sizeof($text); $i++, $j++) {
    $skip = false;
    //check if the word contains punctuation mark
    foreach ($marks as $value) {
        $markPosition = strpos($text[$i], $value);
        //if contains separate punctation mark from the word
        if ($markPosition !== FALSE) {
            //check position of punctation mark - if it's 0 then probably it's punctuation mark by itself like for example dash
            if ($markPosition === 0) {
                //add separate mark to array
                $words[$j] = new Word($j, $text[$i], 2, $this->phpMorphy);
            } else {
                $words[$j] = new Word($j, substr($text[$i], 0, strlen($text[$i]) - 1), 0, $this->phpMorphy);
                //add separate mark to array
                $punctMark = substr($text[$i], -1);
                $j += 1;
                $words[$j] = new Word($j, $punctMark, 1, $this->phpMorphy);
            }
            $skip = true;
            break;
        }
    }
    if (!$skip) {
        $words[$j] = new Word($j, $text[$i], 0, $this->phpMorphy);
    }
}
Max Koretskyi
  • 101,079
  • 60
  • 333
  • 488
  • 4
    You should post your attempt at solving the issue – AlexP Nov 04 '13 at 12:58
  • 3
    http://stackoverflow.com/questions/16137575/preg-split-regex-for-splitting-a-sentence-into-words-and-punctuation-marks – Patrick Geyer Nov 04 '13 at 12:58
  • Shoult sentence ending `?!` be separate `?` and `!` or as one `?!` in the result? Should quotes (like `"` or `'`) be included? If `'` should be included, then what about such case as you've posted: `man's` ? – Alma Do Nov 04 '13 at 13:00
  • 1
    You should at least try something and then ask it here. – Tomás Nov 04 '13 at 13:00
  • @AlmaDo, yes, `?` and `!` should be separated. Quotes like this `"` should be included, this quote `'` should be omitted. – Max Koretskyi Nov 04 '13 at 14:13

2 Answers2

1

The following will split on your specific text.

$words = preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?()-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $text);

See working demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Thanks, that's almost correct. But this `(`, `)`, `"` should also be separate array elements, as well as `?!`. Every punctuation mark has to be placed inside it's own array element. – Max Koretskyi Nov 04 '13 at 14:36
  • That's fantastic, thank you very much! The only thing, is that the array contains spaces. For example, for 8th and 29th element. I can probably iterate over the array and remove elements with spaces, however if you could improve your solution is would it great. Also any chance you could eloborate on what exactly the regex you suggested does? – Max Koretskyi Nov 05 '13 at 05:42
  • Amazing, thanks! Can you also please tell me if I needed to add some other punctuation marks to be separated, where in the regex would I put them? – Max Koretskyi Nov 05 '13 at 19:00
  • What do you want to add – hwnd Nov 05 '13 at 20:02
  • Nothing right now, but I'd like to know how I could this if I needed it :) – Max Koretskyi Nov 06 '13 at 05:56
  • I've thought a little bit and probably would like to add also a `colon` symbol - `:` and `ellipsis` - `...` – Max Koretskyi Nov 06 '13 at 06:12
  • Show me example data. – hwnd Nov 06 '13 at 06:18
  • For example the sentence `I want the following items: butter, sugar, and flour...` – Max Koretskyi Nov 06 '13 at 07:13
  • Can you please help me? I need to include chr(10) in the regexp too. How can I do that? – Max Koretskyi Nov 11 '13 at 10:59
0

Try making use of preg_split. Pass your punctuations(of your choice) inside the square brackets [ and ]

<?php
$str="A man’s jacket is of green color. He – the biggest star in modern history – rides bikes very fast (230 km per hour). How is it possible?! What kind of bike is he using? The semi-automatic gear of his bike, which is quite expensive, significantly helps to reach that speed. Some (or maybe many) claim that he is the fastest in the world! “I saw him ride the bike!” Mr. John Deer speaks. “The speed he sets is 133.78 kilometers per hour,” which sounds incredible; sounds deceiving.";

$keywords=preg_split("/[-,. ]/", $str);

print_r($keywords);

OUTPUT:

Array ( [0] => A [1] => man’s [2] => jacket [3] => is [4] => of [5] => green [6] => color [7] => [8] => He [9] => – [10] => the [11] => biggest [12] => star [13] => in [14] => modern [15] => history [16] => –

Message truncated to prevent abuse of resources ... Shankar ;)

Shankar Narayana Damodaran
  • 68,075
  • 43
  • 96
  • 126
  • It seems that the dot should be in the output array, so splitting by it wouldn't make sense. Also don't feed the helpvampires – HamZa Nov 04 '13 at 13:47