0

I'm trying to sort an array of sentences into groups depending on how similar the words are in each sentence. I want the user to specify how strict or loose the grouping should be so the K-means clustering algorithm looks like a good fit as you can specify the amount of groups however I can't find an example of it being used for sentences although I believe it is possible.

Here is my code so far

require 'vendor/autoload.php';

use Phpml\Clustering\KMeans;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;

// Define the sentences
$sentences = [
    "This is the first sentence.",
    "The second sentence is here.",
    "Here is the third sentence.",
    "And the fourth sentence is here too.",
    "This is a different sentence.",
    "Another unique sentence here."
];

// Tokenize the sentences
$tokenizer = new WhitespaceTokenizer();
$tokenizedSentences = [];
foreach ($sentences as $sentence) {
    $tokens = $tokenizer->tokenize($sentence);
    $tokenizedSentences[] = $tokens;
}

// Vectorize the sentences
$vectorizer = new TokenCountVectorizer($tokenizer);
$vectorizedSentences = $vectorizer->fitTransform($tokenizedSentences);

// Perform K-means clustering
$kmeans = new KMeans(2);
$clusters = $kmeans->cluster($vectorizedSentences);

// Output the clusters
foreach ($clusters as $clusterId => $cluster) {
    echo "Cluster " . ($clusterId + 1) . ":\n";
    foreach ($cluster as $index) {
        echo "- " . $sentences[$index] . "\n";
    }
    echo "\n";
}

I can't quite figure out how to create the Vectorized sentence because the one example I've found seems to use a function that has never existed in TokenCountVectorizer.php.

https://gitlab.com/php-ai/php-ml/-/blob/master/src/FeatureExtraction/TokenCountVectorizer.php?ref_type=heads

Mr J
  • 2,655
  • 4
  • 37
  • 58

0 Answers0