4

So I'm trying to generate slugs to store in my DB. My locales include English, some European languages and Japanese.

I allow \d, \w, European characters are transliterated, Japanese characters are untouched. Period, plus and dash (-) are kept. Leading/trailing whitespace is removed, while the whitespace in between is replaced by a dash.

Here is some code: (please feel free to improve it, given my conditions above as my regex-fu is currently white belt tier)

function ToSlug($string, $separator='-') {
    $url = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
    $url = preg_replace('/[^\d\w一-龠ぁ-ゔァ-ヴー々〆〤.+ -]/', '', $url);
    $url = strtolower($url);
    $url = preg_replace('/[ ' . $separator . ']+/', $separator, $url);
    return $url;
}

I'm testing this function, however my JP characters are not getting through, they are simply replaced by ''. Whilst I do suspect it's the //IGNORE that's taking them out, I need that their or else my German, France transliterations will not work. Any ideas on how I can fix this?

EDIT: I'm not sure if Japanese Kanji covers all of Simplified Chinese but I'm gonna need that and Korean as well. If anyone who knows the regex off the bat please let me know it will save me some time searching. Thanks.

tiffanyhwang
  • 1,413
  • 4
  • 18
  • 26

1 Answers1

4

Note: I am not familiar with the Japanese writing system.

Looking at the function the iconv call appears to remove all the Japanese characters. Instead of using iconv to transliterate, it may be easier to just create a function that does it:

function _toSlugTransliterate($string) {
    // Lowercase equivalents found at:
    // https://github.com/kohana/core/blob/3.3/master/utf8/transliterate_to_ascii.php
    $lower = [
        'à'=>'a','ô'=>'o','ď'=>'d','ḟ'=>'f','ë'=>'e','š'=>'s','ơ'=>'o',
        'ß'=>'ss','ă'=>'a','ř'=>'r','ț'=>'t','ň'=>'n','ā'=>'a','ķ'=>'k',
        'ŝ'=>'s','ỳ'=>'y','ņ'=>'n','ĺ'=>'l','ħ'=>'h','ṗ'=>'p','ó'=>'o',
        'ú'=>'u','ě'=>'e','é'=>'e','ç'=>'c','ẁ'=>'w','ċ'=>'c','õ'=>'o',
        'ṡ'=>'s','ø'=>'o','ģ'=>'g','ŧ'=>'t','ș'=>'s','ė'=>'e','ĉ'=>'c',
        'ś'=>'s','î'=>'i','ű'=>'u','ć'=>'c','ę'=>'e','ŵ'=>'w','ṫ'=>'t',
        'ū'=>'u','č'=>'c','ö'=>'o','è'=>'e','ŷ'=>'y','ą'=>'a','ł'=>'l',
        'ų'=>'u','ů'=>'u','ş'=>'s','ğ'=>'g','ļ'=>'l','ƒ'=>'f','ž'=>'z',
        'ẃ'=>'w','ḃ'=>'b','å'=>'a','ì'=>'i','ï'=>'i','ḋ'=>'d','ť'=>'t',
        'ŗ'=>'r','ä'=>'a','í'=>'i','ŕ'=>'r','ê'=>'e','ü'=>'u','ò'=>'o',
        'ē'=>'e','ñ'=>'n','ń'=>'n','ĥ'=>'h','ĝ'=>'g','đ'=>'d','ĵ'=>'j',
        'ÿ'=>'y','ũ'=>'u','ŭ'=>'u','ư'=>'u','ţ'=>'t','ý'=>'y','ő'=>'o',
        'â'=>'a','ľ'=>'l','ẅ'=>'w','ż'=>'z','ī'=>'i','ã'=>'a','ġ'=>'g',
        'ṁ'=>'m','ō'=>'o','ĩ'=>'i','ù'=>'u','į'=>'i','ź'=>'z','á'=>'a',
        'û'=>'u','þ'=>'th','ð'=>'dh','æ'=>'ae','µ'=>'u','ĕ'=>'e','ı'=>'i',
    ];
    return str_replace(array_keys($lower), array_values($lower), $string);
}

So, with some modifications, it could look something like this:

function toSlug($string, $separator = '-') {
    // Work around this...
    #$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
    $string = _toSlugTransliterate($string);

    // Remove unwanted chars + trim excess whitespace
    // I got the character ranges from the following URL:
    // https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters#10508813
    $regex = '/[^一-龠ぁ-ゔァ-ヴーa-zA-Z0-9a-zA-Z0-9々〆〤.+ -]|^\s+|\s+$/u';
    $string = preg_replace($regex, '', $string);

    // Using the mb_* version seems safer for some reason
    $string = mb_strtolower($string);

    // Same as before
    $string = preg_replace("/[ {$separator}]+/", $separator, $string);
    return $string;
}
$x = '   æøå!this.ís-a test-ゔヴ ーァ   ';
echo toSlug($x);

In regex you can use unicode "scripts" to match letters of various languages. There is no "Japanese" one, but there are Hiragana, Katakana and Han. As I have no idea how Japanese is written, and how one could use these, I am not even going to try.

Using these scripts, however, would be done something like this:

'/[\p{Hiragana}\p{Katakana}\p{Han}]+/'
Sverri M. Olsen
  • 13,055
  • 3
  • 36
  • 52
  • 1
    Works perfectly. I'm processing hundreds of thousands of records so the performance might have suffered a little. – tiffanyhwang Feb 03 '14 at 09:19