I'm trying to sort an array of sentences into groups depending on how similar the words are in each sentence. I want the user to specify how strict or loose the grouping should be so the K-means clustering algorithm looks like a good fit as you can specify the amount of groups however I can't find an example of it being used for sentences although I believe it is possible.
Here is my code so far
require 'vendor/autoload.php';
use Phpml\Clustering\KMeans;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
// Define the sentences
$sentences = [
"This is the first sentence.",
"The second sentence is here.",
"Here is the third sentence.",
"And the fourth sentence is here too.",
"This is a different sentence.",
"Another unique sentence here."
];
// Tokenize the sentences
$tokenizer = new WhitespaceTokenizer();
$tokenizedSentences = [];
foreach ($sentences as $sentence) {
$tokens = $tokenizer->tokenize($sentence);
$tokenizedSentences[] = $tokens;
}
// Vectorize the sentences
$vectorizer = new TokenCountVectorizer($tokenizer);
$vectorizedSentences = $vectorizer->fitTransform($tokenizedSentences);
// Perform K-means clustering
$kmeans = new KMeans(2);
$clusters = $kmeans->cluster($vectorizedSentences);
// Output the clusters
foreach ($clusters as $clusterId => $cluster) {
echo "Cluster " . ($clusterId + 1) . ":\n";
foreach ($cluster as $index) {
echo "- " . $sentences[$index] . "\n";
}
echo "\n";
}
I can't quite figure out how to create the Vectorized sentence because the one example I've found seems to use a function that has never existed in TokenCountVectorizer.php.