I want to find group of words without knowing them before although
with your function I need to provide them before anything
Try this:
mb_internal_encoding('UTF-8');
$joinedExcerpts = implode(".\n", $excerpts);
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);
$wordsSequencesCount = array();
foreach($sentences as $sentence) {
$words = array_map('mb_strtolower',
preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY));
foreach($words as $index => $word) {
$wordsSequence = '';
foreach(array_slice($words, $index) as $nextWord) {
$wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord;
if( !isset($wordsSequencesCount[$wordsSequence]) ) {
$wordsSequencesCount[$wordsSequence] = 0;
}
++$wordsSequencesCount[$wordsSequence];
}
}
}
$ngramsCount = array_filter($wordsSequencesCount,
function($count) { return $count > 1; });
I'm assuming you only want repeated group of words.
The ouput of var_dump($ngramsCount);
is:
array (size=11)
'i' => int 3
'i love' => int 2
'love' => int 2
'cheap' => int 3
'cheap red' => int 3
'cheap red apples' => int 3
'red' => int 5
'red apples' => int 5
'apples' => int 6
'are' => int 2
'my' => int 2
The code could be optimized to, for instance, use less memory.