0

There is an array in which there are many million words. And you need to create an associative array with the wrong variants of all these words passing the correct version of that word as the key. And the wrong variant of the word must not coincide with the correct words in the array. And still all the wrong variants of words, too, should not coincide with each other. All these generation of incorrect variants of words I need to correct incorrect Cyrillic words (not Russian words and not English). As an example, take the words "apple" and "lost". Array with correct words for creating incorrect variants this words:

<?php
$correct_words = array(
   "apple",
   "lost",
   "lot",
   "microsoft"
); 
?>

I want the result to be so:

<?php
$incorrect_variant_words = array(
    "aple"=>"apple",
    "lst"=>"lost",
    "lt"=>"lot",
    "microsot"=>"microsoft",
    "microsft"=>"microsoft",
    "microoft"=>"microsoft",
    "micrsoft"=>"microsoft",
    "micosoft"=>"microsoft",
    "mirosoft"=>"microsoft",
    "mcrosoft"=>"microsoft"
);
?>

I want to correct the incorrect words. Give advice or there is a solution for this task, please tell me. As for example in Google translator such function is implemented. How to get around this problem without the php extension of Pspell. Please help me to solve such a difficult task. To use as a correct word I also add an array of words with correct values.

<?php

$array = array(

  "миёнаҳои",
  "луғатҳои",
  "онандроҷ",
  "ганҷинаи",
  "ҷамъиятӣ",
  "иҷтимоии",
  "муҳаммад",
  "рӯзмарра",
  "ҳамзабон",
  "забонҳои",
  "ҳамчунин",
  "фарҳанге",
  "феҳристи",
  "зардуштӣ",
  "таркибҳо",
  "ибораҳои",
  "калимаҳо",
  "фарҳанги",
  "тобишҳои",
  "намунаҳо",
  "нусхаҳои",
  "фирдавсӣ",
  "ҳуруфоти",
  "мутобиқи",
  "тақрибан",
  "алоҳидаи",
  "тоисломӣ",
  "паҳлавик",
  "классикӣ",
  "мӯътабар",
  "қадамҳои",
  "баргаҳои"

);

?>

Thank you in advance

John
  • 468
  • 3
  • 16
  • Why is there only one variant of "apple" but seven of Microsoft? – Andreas Oct 03 '17 at 17:14
  • What about the word "list" will that too have "lst" as a variant? How do you tell them apart? – Andreas Oct 03 '17 at 17:15
  • I forgot an apple value "appe" "ale". Yes you right in word "list" to can be incorrect variant "lst". What to do until I myself do not know with these errors – John Oct 03 '17 at 17:23

1 Answers1

2

Use similar_text to iterate over the array of correct words and compare them to the input value. Return the word with the highest match percentage. Basic concept:

$correct_words = array(
   "apple",
   "lost",
   "lot",
   "microsoft"
);
$input = 'lst';
$match = 0;
foreach ($correct_words as $correct) {
similar_text($correct, $input, $percent);
    if ($percent > $match) {
        $result = $correct;
        $match = $percent;
    }
}
echo $result;

Output is lost

Edit to add result of your query

$correct_words = array(
   "тоҷик",
   "тоҷикӣ",
   "тоҷики"
);
$input = array("тоҷикӣ", "тоҷики", "точик", "точикӣ", "точики", "тоики", "тоикӣ", "тоҷӣкӣ", "тҷикӣ", "тчики", "тҷӣкӣ", "тчик");
foreach ($input as $in) {
$match = 0;
    foreach ($correct_words as $correct) {
similar_text($correct, $in, $percent);
    if ($percent > $match) {
        $result = $correct;
        $match = $percent;
    }
}
echo "$in is corrected to $result\r\n";
}

Result is:

тоҷикӣ is corrected to тоҷикӣ
тоҷики is corrected to тоҷики
точик is corrected to тоҷик
точикӣ is corrected to тоҷикӣ
точики is corrected to тоҷики
тоики is corrected to тоҷики
тоикӣ is corrected to тоҷикӣ
тоҷӣкӣ is corrected to тоҷикӣ
тҷикӣ is corrected to тоҷикӣ
тчики is corrected to тоҷики
тҷӣкӣ is corrected to тоҷикӣ
тчик is corrected to тоҷик
miknik
  • 5,748
  • 1
  • 10
  • 26
  • 1
    And what if 3 words are very similar. For example, there are three words "тоҷики" "тоҷикӣ" "тоҷик" in the correct form. In it, too, this rule works correctly if these words are entered in the non-correct version – John Oct 03 '17 at 17:32
  • 1
    Give me some non correct entries to try and I'll run them through and tell you what it spits out – miknik Oct 03 '17 at 17:37
  • 1
    тоҷикӣ тоҷики точик точикӣ точики тоики тоикӣ тоҷӣкӣ тҷикӣ тчики тҷӣкӣ тчик – John Oct 03 '17 at 17:40
  • 1
    how to correct the text? $text = "тоҷикӣ ман ба марс тоҷики бо лахзаи точик дарёфт карда точикӣ зада ба точики назди ӯ тоики онҳо тоикӣ бисёр давад тоҷӣкӣ шумо то тҷикӣ ба назди тчики мо пеши тҷӣкӣ назар кунед тчик точико тоҷико"; – John Oct 03 '17 at 17:45
  • 1
    Updated my answer with the output from your list – miknik Oct 03 '17 at 17:53
  • how to fix the full text? For example, the text can be a period of comma characters and numbers without losing the original format. $text = "тоҷикӣ ман ба марс тоҷики! бо лахзаи точик, дарёфт карда точикӣ зада. ба точики назди... ӯ тоики онҳо тоикӣ бисёр? давад тоҷӣкӣ шумо то тҷикӣ ба назди тчики мо пеши тҷӣкӣ назар кунед тчик точико тоҷико" – John Oct 03 '17 at 17:59
  • Are you only matching for those 3 words, or do you have a bigger array of correct words to use? You can explode your string into an array of single words, use some regex to ignore but preserve punctuation and numbers and then iterate through the array checking each word against the list. Create another array and add either the original word or your replacement to the new array and then implode it at the end. Your results will be influenced heavily by how many correct words you have in your dictionary and the thresholds you set for replacing a word or leaving it. – miknik Oct 03 '17 at 19:07