2

I have been scratching my head on this but cannot work out a solution.

Let's say you have a text of 5000 characters, I would like to split it into blocks of less than 500 characters, but, without breaking a single sentence. eg: if a paragraph is let's say 550 words and the last sentence stops at 550 characters but start at 450 characters, I would like to save this particular block to a maximum of 450 characters(this way no sentences are broken).

Any idea how to achieve this please?

My goal is to save each block into an array so I can work on them separately.

I was thinking about using preg_split, sum the outputs, and if the sum is above 500 characters, remove the last sum. But.....I find it difficult to separate the sentences without mistakes.

Any idea what preg_split rules I should use to make sure that every single sentences are well separated?

I tried to use this tool but cannot get it to give me the right output: https://www.phpliveregex.com/#tab-preg-split

Thanks

Benny
  • 430
  • 6
  • 17
  • 2
    Have you figured out how to split sentences in general? If so, please update that into your post. (There are quite a few edge cases with abbreviations etc. that will produce incorrect splits.) – Markus AO Mar 26 '22 at 12:59
  • Clearly, regex or a direct string approach are not the way to split a text into sentences: `intlBreakIterator` is the class you need. – Casimir et Hippolyte Apr 21 '22 at 12:16

3 Answers3

1

First of all: Thank you for the nice question!

The solution is not really stable and you have to adjust in the future. But it will shows you the possible way to archive this.

Split your text into the individual sentences and save each sentence as an element in an array. This way you can determine the length of the sentences when iterating the array. As long as the sentence and the previous sentence are smaller than the maximum block length, put the string into a temporary variable. As soon as the length of the text of the temporary variable + the length of the current record are greater than the maximum block length, the record is stored in a new array as a block.

<?php
$txt = "111. 222 222. 333 333 333. 444 444 444 444. 555 555 555 555 555. 333 333 333. 222 222. 111.";

$length = 30;
$arr = explode(". ", $txt);
$b = [];
$tmp = '';

foreach($arr as $k => $s) {
    if (strlen($s) + strlen($tmp) <= ($length) ) {
        $tmp = $tmp . $s .'. ';
    } else {
        $b[] = $tmp;
        $tmp = '';
        $tmp = $s . '. ';
    }
    
    if((count($arr)-1) === $k) {
        $b[] = substr($tmp, 0, -2);   
    }    
}

print_r($arr);
print_r($b);

Output

// Sentence Array
Array
(
    [0] => 111
    [1] => 222 222
    [2] => 333 333 333
    [3] => 444 444 444 444
    [4] => 555 555 555 555 555
    [5] => 333 333 333
    [6] => 222 222
    [7] => 111.
)

// Your new Block Array
Array
(
    [0] => 111. 222 222. 333 333 333. 
    [1] => 444 444 444 444.
    [2] => 555 555 555 555 555.
    [3] => 333 333 333. 222 222. 111.
)

Maik Lowrey
  • 15,957
  • 6
  • 40
  • 79
  • 1
    Thank you so much for your effort, I will try your idea in the mid of the week(sick at the moment). But what I initially thought would be simple gave me some problems :-) I really did not think it would be so tricky! – Benny Mar 28 '22 at 09:39
  • @Benny Yes it is definitely not easy. The weak point will be to find an exact delimitation of the sentence. Here, a regex is definitely better than a simple explode. Because sentences can also end with `?`, `!` , `.` and `...`. That's why I noted in my answer that it doesn't run stably. Please keep me informed. I am also very interested :-) But first: Get well soon! – Maik Lowrey Mar 28 '22 at 09:49
1

Seems easier to split by sentence, then you should be able to loop on it and concatenate if you are over your boundary

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.]+\.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

var_dump($cleaned);

I have that

array(22) {
  [1]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [3]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [5]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [7]=>
  string(42) "Elementum facilisis leo vel fringilla est."
  [9]=>
  string(27) "Sem et tortor consequat id."
  [11]=>
  string(44) "Eleifend donec pretium vulputate sapien nec."
  [13]=>
  string(43) "Elit pellentesque habitant morbi tristique."
  [15]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [17]=>
  string(40) "Quis commodo odio aenean sed adipiscing."
  [19]=>
  string(53) "Id volutpat lacus laoreet non curabitur gravida arcu."
  [21]=>
  string(40) "Sit amet massa vitae tortor condimentum."
  [23]=>
  string(49) "Morbi blandit cursus risus at ultrices mi tempus."
  [25]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [27]=>
  string(38) "Urna et pharetra pharetra massa massa."
  [29]=>
  string(32) "Ut consequat semper viverra nam."
  [31]=>
  string(47) "Hac habitasse platea dictumst quisque sagittis."
  [33]=>
  string(46) "Commodo odio aenean sed adipiscing diam donec."
  [35]=>
  string(45) "Imperdiet proin fermentum leo vel orci porta."
  [37]=>
  string(40) "Quisque non tellus orci ac auctor augue."
  [39]=>
  string(37) "In cursus turpis massa tincidunt dui."
  [41]=>
  string(38) "Purus faucibus ornare suspendisse sed."
  [43]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}

Quick update for Maik ;)

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.]+\.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

$lines = [];
$current = '';
$min = 50;

foreach ($cleaned as $sentence) {
  $current .= $sentence . ' '; // Mandatory to allow to add an other sentence
  $len_current = strlen($current);

  if ($len_current >= $min) {
    array_push($lines, trim($current)); // As we add an extra space, we remove it when adding to the lines

    $current = '';
  }
}

Looks like this

array(14) {
  [0]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [1]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [2]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [3]=>
  string(70) "Elementum facilisis leo vel fringilla est. Sem et tortor consequat id."
  [4]=>
  string(88) "Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique."
  [5]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [6]=>
  string(94) "Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu."
  [7]=>
  string(90) "Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus."
  [8]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [9]=>
  string(71) "Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam."
  [10]=>
  string(94) "Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec."
  [11]=>
  string(86) "Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue."
  [12]=>
  string(76) "In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed."
  [13]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}
Joel
  • 1,187
  • 1
  • 6
  • 15
-1

I Think you need this

$string = "Hello world php is fun";
$array = explode(" ", $string);

OutPut is

Array ( [0] => Hello [1] => world [2] => php [3] => is [4] => fun )
RIZI
  • 104
  • 1
  • 7
  • 1
    Can you explain how this answers the OP's question? – The fourth bird Mar 26 '22 at 12:35
  • What is OP'S I don't Understand you? – RIZI Mar 26 '22 at 12:39
  • OP is the Original Poster. The question is about splitting text into blocks of less than 500 characters with unbroken sentences. Your code splits on a space, which does not answer the question right? – The fourth bird Mar 26 '22 at 12:45
  • ok got it. Can OP's provide an example, please? – RIZI Mar 26 '22 at 12:47
  • 1
    Now that is a good question, and that should be added in the comment section under the question. There you might also ask other questions like for example updating the question with some code that the OP tried to improve the quality of the question. – The fourth bird Mar 26 '22 at 12:54