0

This is my problem: I have a big string (near 8000 chars) and i want 2 things:

  1. Detect sentence boundaries like '.' AND
  2. Have sentences that have no more than 600 chars

I know that in some cases it will not be possible to have both. In this case, find a space and split the sentence.

This solution by ridgerunner for the condition number 1 worked like a charm, please see original link (http://goo.gl/PqI6d), but it often output sentences bigger than 600 chars. Any light?? Thanks in advance!

  • Check whether this regex is what you want: `/(?:[^.]{1,20}(?: |\.)|\w{20,}(?: |\.)?)/`. You can change `20` to `600` to fit your case. Test case: `This is a short sentence. This is a very very very very very very long long long long long long sentence. Andthisisaverylongwordwithoutspaces.` – nhahtdh Jul 09 '12 at 05:49

2 Answers2

0

You would probably be better off matching strings instead. Your regex for the match could look like the following:

(.{0,600}?\.)|(.{0,600}(?=\ ))

In short, you first look for as small as possible of a string before a period as possible. if there is none, you look for as long of a string as possible, followed by a space. Then the next match will pick up from where you left off.

Note that that is generic regex. Your php implementation may vary.

Arithmomaniac
  • 4,604
  • 3
  • 38
  • 58
0

Tks nhahtdh. Please see if im missing something. Below is an excerpt from my string and the output using your suggestion.

<?php 
    $ptn = "/(?:[^.]{1,600}(?: |\.)|\w{600,}(?: |\.)?)/";
    $str = "Amblyopia occurs when the nerve pathway from one eye to the brain does not develop during childhood. This occurs because the abnormal eye sends a blurred image or the wrong image to the brain. This confuses the brain, and the brain may learn to ignore the image from the weaker eye. Strabismus is the most common cause of amblyopia. There is often a family history of this condition. The term "lazy eye" refers to amblyopia, which often occurs along with strabismus. However, amblyopia can occur without strabismus and people can have strabismus without amblyopia.First, any eye condition that is causing poor vision in the amblyopic eye (such as cataracts) needs to be corrected. Children with a refractive error (nearsightedness, farsightedness, or astigmatism) will need glasses. Next, a patch is placed on the normal eye. This forces the brain to recognize the image from the eye with amblyopia. Sometimes, drops are used to blur the vision of the normal eye instead of putting a patch on it. Children whose vision will not fully recover, and those with only good eye due to any disorder should wear glasses with protective polycarbonate lenses. Polycarbonate glasses are shatter- and scratch-resistant. Children who get treated before age 5 will usually recover almost completely normal vision, although they may continue to have problems with depth perception. Delaying treatment can result in permanent vision problems. After age 10, only a partial recovery of vision can be expected. Early recognition and treatment of the problem in children can help to prevent permanent visual loss. All children should have a complete eye examination at least once between ages 3 and 5. Special techniques are needed to measure visual acuity in a child who is too young to speak. Most eye care professionals can perform these techniques.";
    preg_split($ptn, $str, -1, PREG_SPLIT_NO_EMPTY);
    print_r($result);
    ?>

Result: I need sentences from my string smaller than 600 char

 Array
(
[0] => childhood.
[1] => brain.
[2] => eye.
[3] => amblyopia.
[4] => condition.
[5] => strabismus.
[6] => amblyopia.
[7] => corrected.
[8] => glasses.
[9] => eye.
[10] => amblyopia.
[11] => it.
[12] => lenses.
[13] => scratch-resistant.
[14] => perception.
[15] => problems.
[16] => expected.
[17] => loss.
[18] => 5.
[19] => speak.
[20] => techniques
)