2

I want to find out the language used from a web page. Here I guess based on some of the words that come in the keyword list.

This script I get from http://www.kangsigit.com/2017/08/php.deteksi-bahasa.html

How this code works is just matching words to the "INDONESIAN and ENGLISH" keyword list. If one of your keywords comes in, then that's the language detected.

The code:-

$tulisan = "Hari ini saya dapat senyum oleh suatu hal";
 function Bahasa($tulisan, $terjemahkan) {
      $bahasa_pilihan = array('INDONESIAN','ENGLISH');
      $katakunci['INDONESIAN'] = array ('cinta', 'marah', 'sayang', 'benci', 'senyum', 'peluk');
      $katakunci['ENGLISH'] = array ('the', 'and', 'have', 'for', 'with', 'you');
      $tulisan = preg_replace("/[^A-Za-z]/", ' ', $tulisan);
      foreach ($bahasa_pilihan as $bahasa) {
        $kalkulasi[$bahasa]=0;
      }
      for ($i = 0; $i < 6; $i++) {
        foreach ($bahasa_pilihan as $bahasa) {
          $kalkulasi[$bahasa] = $kalkulasi[$bahasa] +

            substr_count($tulisan, ' ' .$katakunci[$bahasa][$i] . ' ');;
        }
      }
      $max = max($kalkulasi);
      $maxs = array_keys($kalkulasi, $max);
      if (count($maxs) == 1) {
        $pemenang = $maxs[0];
        $pertamax = 0;
        foreach ($bahasa_pilihan as $bahasa) {
          if ($bahasa <> $pemenang) {
            if ($kalkulasi[$bahasa]>$pertamax) {
              $pertamax = $kalkulasi[$bahasa];
            }
          }
        }
        if (($pertamax / $max) < 0.1) {
          return $pemenang;
        }
      }
      return $terjemahkan;
    }
 echo Bahasa($tulisan, $terjemahkan);

But there is a problem here. If the keyword "INDONESIAN and ENGLISH" enters all, then the script becomes error.

An example is changed like this:

$tulisan = "Hari ini saya dapat senyum oleh suatu hal, you know?";

The two words "senyum", and "you" come from different keywords. Generate error.

Is there a way to fix it?

UPDATE:

If in Indonesian there are 2 words, and English is only one word, then the Indonesian language is the winner. But the code above does not work as I expected.

For example:

$tulisan = "Hari ini saya cinta dan dapat senyum oleh suatu hal, you know?";

There are two words from the Indonesian language, namely (cinta and senyum).

There is one word from English, that is (you).

So it should be, the detected language is INDONESIA.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
GeeJhon
  • 67
  • 6

2 Answers2

1

I think you need to do it like below:-

<?php

$tulisan = "Hari ini saya dapat senyum oleh suatu hal";

function Bahasa($tulisan) {
  $bahasa_pilihan = array('INDONESIAN','ENGLISH');
  $katakunci['INDONESIAN'] = array ('cinta', 'marah', 'sayang', 'benci', 'senyum', 'peluk');
  $katakunci['ENGLISH'] = array ('the', 'and', 'have', 'for', 'with', 'you');

  $exploded_string = explode(' ',$tulisan);
  $indonasian_counter = 0;
  $english_counter = 0;

  foreach($exploded_string as $string){
     if(in_array($string, $katakunci['INDONESIAN'])){
       $indonasian_counter +=1;
     }
      if(in_array($string, $katakunci['ENGLISH'])){
       $english_counter +=1;
     }
  }
  if($indonasian_counter >$english_counter){
    echo "given string have more Indonesian words";echo PHP_EOL;
  }
  if($english_counter > $indonasian_counter){
    echo "given string have more English words";echo PHP_EOL;
  }
  if($english_counter == $indonasian_counter){
    echo "given string have a tie between Languages";echo PHP_EOL;
  }    

}

Bahasa($tulisan);

Output:-https://eval.in/842143 OR https://eval.in/842145 (case-insensitive)

Note:- if you want to make it case-insensitive search then do:-

if(in_array(strtolower($string), array_map("strtolower",$katakunci['INDONESIAN']))){

Ans same for English:-

if(in_array(strtolower($string), array_map("strtolower",$katakunci['ENGLISH']))){
Alive to die - Anant
  • 70,531
  • 10
  • 51
  • 98
0

This is an optimized method that maintains your search words from each language as an array.

It uses the power of preg_match_all() with a pattern including word boundaries, alternatives, and a case-insensitive flag.

This method is very well suited for your case because you will not need to prepare your string using preg_replace() or strtolower().

The condition statement is built for speed in that if the search for English matches results in 0, then the search for Indonesian matches is never called. In other words, when there are no English words, there are only two function calls before the return (specifically: preg_match_all() once and implode() once). When there are 1 or more English words in $tulisan the same two functions are called just one more time each.

preg_match_all() is the perfect function for this task because it removes the need for any looping, it can be set to case-insensitive, and it returns the number of matches that it finds.

function Bahasa($tulisan){
    $katakunci['INDONESIAN'] = array ('cinta', 'marah', 'sayang', 'benci', 'senyum', 'peluk');
    $katakunci['ENGLISH'] = array ('the', 'and', 'have', 'for', 'with', 'you');
    if(($eng=preg_match_all('/\b(?:'.implode('|',$katakunci['ENGLISH']).')\b/i',$tulisan)) && $eng>preg_match_all('/\b(?:'.implode('|',$katakunci['INDONESIAN']).')\b/i',$tulisan)){
        return 'English';  // if English > 0 AND English is greater than Indonesian
    }else{
        return "Indonesian";  // if English == 0 OR Indonesian >= English
    }
}

These are some calls and outputs: (Demo)

$tulisan = "Hari ini saya dapat senyum oleh suatu hal, you know?";
echo Bahasa($tulisan);  // Indonesian  (because senyum x1, you x1

$tulisan = "Hari ini saya dapat senyum oleh suatu hal?";
echo Bahasa($tulisan);  // Indonesian  (because no English)

$tulisan = "You know, hari ini saya dapat senyum oleh suatu hal, you know?";
echo Bahasa($tulisan);  // English  (because senyum x1, you x2)

Now if you are happy/comfortable dealing directly with the pattern expression, you can improve efficiency and brevity like this:

function Bahasa($tulisan){
    if(($eng=preg_match_all('/\b(?:the|and|have|for|with|you)\b/i',$tulisan)) && $eng>preg_match_all('/\b(?:cinta|marah|sayang|benci|senyum|peluk)\b/i',$tulisan)){
        return 'English';  // if English > 0 AND English is greater than Indonesian
    }else{
        return "Indonesian";  // if English == 0 OR Indonesian >= English
    }
}
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • @GeeJhon Please have another look at my answer. I have taken the time to optimize my answer and explain my method. – mickmackusa Aug 07 '17 at 08:00