I am newbie on Text Classification and I am trying to create some proof-of-concepts to understand better the concepts of ML using PHP. So I got this example, and I've tried to add a new small text to "reinforce" one of my labels (categories), in this case, Japan:
<?php
include_once './vendor/autoload.php';
//source: https://www.softnix.co.th/2018/08/19/naive-bays-text-classification-with-php/
use Phpml\Classification\NaiveBayes;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
use Phpml\Tokenization\WordTokenizer;
use Phpml\FeatureExtraction\TfIdfTransformer;
$arr_text = [
"London bridge is falling down",
"japan samurai Universal Studio spider man",
"china beijing",
"thai Chiangmai",
"Universal Studio Hollywood",
"2020 Olympic games"
];
$arr_label = [
"London","Japan","China","Thailand","USA","Japan"
];
$tokenize = new WordTokenizer();
$vectorizer = new TokenCountVectorizer($tokenize);
$vectorizer->fit($arr_text);
$vocabulary = $vectorizer->getVocabulary();
$arr_transform = $arr_text;
$vectorizer->transform($arr_transform);
$transformer = new TfIdfTransformer($arr_transform);
$transformer->transform($arr_transform);
$classifier = new NaiveBayes();
$classifier->train($arr_transform, $arr_label);
$arr_testset = [
'Hello Chiangmai I am Siam',
'I want to go Universal Studio',
'I want to go Universal Studio because I want to watch spider man',
'Sonic in 2020'
];
$vectorizer->transform($arr_testset);
$transformer->transform($arr_testset);
$result = $classifier->predict($arr_testset);
var_dump($result);
The problem is, after added Japan again on array of labels, the result was:
array (size=4)
0 => string 'Japan' (length=5)
1 => string 'Japan' (length=5)
2 => string 'Japan' (length=5)
3 => string 'Japan' (length=5)
But I was expecting:
array (size=4)
0 => string 'Thailand' (length=8)
1 => string 'USA' (length=3)
2 => string 'Japan' (length=5)
3 => string 'Japan' (length=5)
So, How add new samples to the same label?