-1

I have a PHP array:

$excerpts = array(
    'I love cheap red apples',
    'Cheap red apples are what I love',
    'Do you sell cheap red apples?',
    'I want red apples',
    'Give me my red apples',
    'OK now where are my apples?'
);

I would like to find all the n-grams in these lines to get a result like this:

  • cheap red apples: 3
  • red apples: 5
  • apples: 6

I tried to implode the array and then parse it, but it's stupid because new n-grams can be found because of the concatenation of strings that have nothing to see between each other.

How would you proceed?

Crisoforo Gaspar
  • 3,740
  • 2
  • 21
  • 27
mattspain
  • 723
  • 9
  • 18
  • 1
    To proceed, I would look up n-gram algorithms, and then decide which would be appropriate to implement on this data set. First call: [wikipedia on N-grams](http://en.wikipedia.org/wiki/N-gram). – i alarmed alien Oct 19 '14 at 22:14
  • Thanks for your suggestion, this is what I did, but I needed any solution or at least concrete examples which would give me the final output I provided. – mattspain Oct 20 '14 at 11:42

4 Answers4

7

I want to find group of words without knowing them before although with your function I need to provide them before anything

Try this:

mb_internal_encoding('UTF-8');

$joinedExcerpts = implode(".\n", $excerpts);
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

$wordsSequencesCount = array();
foreach($sentences as $sentence) {
    $words = array_map('mb_strtolower',
                       preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY));
    foreach($words as $index => $word) {
        $wordsSequence = '';
        foreach(array_slice($words, $index) as $nextWord) {
                $wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord;
            if( !isset($wordsSequencesCount[$wordsSequence]) ) {
                $wordsSequencesCount[$wordsSequence] = 0;
            }
            ++$wordsSequencesCount[$wordsSequence];
        }
    }
}

$ngramsCount = array_filter($wordsSequencesCount,
                            function($count) { return $count > 1; });

I'm assuming you only want repeated group of words. The ouput of var_dump($ngramsCount); is:

array (size=11)
  'i' => int 3
  'i love' => int 2
  'love' => int 2
  'cheap' => int 3
  'cheap red' => int 3
  'cheap red apples' => int 3
  'red' => int 5
  'red apples' => int 5
  'apples' => int 6
  'are' => int 2
  'my' => int 2

The code could be optimized to, for instance, use less memory.

Pedro Amaral Couto
  • 2,056
  • 1
  • 13
  • 15
1

The code provided by Pedro Amaral Couto above is very good. Since I use it for French, I modified the regular expression as follows:

$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

This way, we can analyze the words containing hyphens and apostrophes ("est-ce que", "j'ai", etc.)

Community
  • 1
  • 1
0

Try this (using the implode, since that's you've mentioned as an attempt):

$ngrams = array(
    'cheap red apples',
    'red apples',
    'apples',
);

$joinedExcerpts = implode("\n", $excerpts);
$nGramsCount = array_fill_keys($ngrams, 0);
var_dump($ngrams, $joinedExcerpts);
foreach($ngrams as $ngram) {
    $regex = '/(?:^|[^\pL])(' . preg_quote($ngram, '/') . ')(?:$|[^\pL])/umi';
    $nGramsCount[$ngram] = preg_match_all($regex, $joinedExcerpts);
}
Pedro Amaral Couto
  • 2,056
  • 1
  • 13
  • 15
  • The point is: I want to find group of words without knowing them before although with your function I need to provide them before anything. Thanks anyway for the help. – mattspain Oct 20 '14 at 11:44
  • Sorry I misunderstood the question. Should the group of words "I", "I love" and "are" be considered n-grams and should the group words that are not repeated be ignored ("Do", "Do you", etc.)? – Pedro Amaral Couto Oct 20 '14 at 12:05
-1

Assuming you just want to count the number of occurrences of a string:

$cheapRedAppleCount = 0;
$redAppleCount = 0;
$appleCount = 0;
for($i = 0; $i < count($excerpts); $i++)
{
    $cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]);
    $redAppleCount += preg_match_all('red apples', $excerpts[$i]);
    $appleCount += preg_match_all('apples', $excerpts[$i]);
}

preg_match_all returns the number of matches in a given string so you can just add the number of matches onto a counter.

preg_match_all for more information.

Apologies if I misunderstood.

user1849060
  • 621
  • 3
  • 10
  • 20
  • 1
    I guess the OP probably wants to find all n-grams in any set of strings, not just those three in those particular strings. :\ – i alarmed alien Oct 19 '14 at 22:27
  • I want to find group of words without knowing them before, and unfortunately this doesn't meet my requirements. Thanks anyway for the help. – mattspain Oct 20 '14 at 11:41