3

I have a array, each item contains a first and a last name:

$input = [
  [
    'firstName' => 'foo',
    'lastName' => 'bar',
  ]
];

For most users they are mostly in the latin alphabet, but some are written in Chinese.

How would I sort this list of names using PHP?

I'm also curious about convention. I know in languages using the latin-1 alphabet, sometimes the first name comes first, and at other times the last. I'm curious if this situation is similar in mandarin, or if one is usually preferred over the other.

And lastly I'm curious if there's a difference between sorting of names and sorting of words, like in a dictionary.

Evert
  • 93,428
  • 18
  • 118
  • 189
  • A basic notion to all sorting is that you have to have elements that can be compared. I'm not seeing that if the data contains latin characters and chinese characters. – BigScar Apr 30 '15 at 20:09
  • 2
    It doesn't seem that chinese even has a hard set of rules of the order of characters: http://www.cantonese.sheik.co.uk/phorum/read.php?1,122672,122681 – Jeremy Harris Apr 30 '15 at 20:11
  • 1
    @BigScar one option is that if a list of names contains mixed latin and chinese names, we just pick one of those two and display it first. I'm more concerned about properly sorting the Chinese names amongst themselves. Curious if there's some best practices out there. – Evert Apr 30 '15 at 20:17
  • For Chinese/Korean/Japanese, you always do family name first. In the example of Kim Jong-il, Kim is the family name and Jong-il is the given name. We run into an issue of Japanese and Chinese both using hanzi/kanji for names, and I believe both languages sort names differently. – Muhammad Abdul-Rahim Apr 30 '15 at 20:45
  • 1
    I did some research on Japanese sorting, @Evert, and it's very non-trivial because kanji can be pronounced differently depending on context. Many sites in Japan, like Amazon, ask the user not only to put their name in kanji, but also in kana. Kana can be sorted easily since it's a 1-to-1 for pronunciation. Kanji can't. 淳子 can be Junko, Atsuko, Kiyoko, Akiko... How does Chinese Amazon look? Do they have a Chinese Amazon? – Muhammad Abdul-Rahim May 06 '15 at 19:01

1 Answers1

0

Really interesting question! Each character has a Unicode value. Most sorting is done through that. Since Latin letters are in the ASCII range, those names always come up first. PHP's asort function will take Unicode into consideration. Here is an input to consider:

$input = [
    [
        "firstName" => "一",
        "lastName"  => "風"
    ],
    [
        "firstName" => "이",
        "lastName"  => "정윤"
    ],
    [
        "firstName" => "Mari",
        "lastName"  => "M"
    ],
    [
        "firstName" => "三",
        "lastName"  => "火"
    ],
];

Let's summarize what I expect to see, assuming we sort by first name:

  • Latin name first (Mari M)
  • Hanzi/kanji/hangeul names next. I don't know what the values of these names are, so we have to find out.

Let's convert the first character of the first names to something numeric. Again, we are using Unicode for this conversion:

  • 一 is 0x4E00
  • 이 is 0xC774
  • M is 0x004D
  • 三 is 0x4E09

As such, I expect to see, in order:

  • M

Here is my code, using asort:

$nameByFirst = [];
foreach( $input as $i )
{
    $nameByFirst[] = $i["firstName"]." ".$i["lastName"];
}
asort($nameByFirst);

And my printing method:

$i = 1;
foreach( $nameByFirst as $name )
{
    echo $i.'.  '.$name."<br>";
    $i++;
}

And my output:

  1. Mari M
  2. 一 風
  3. 三 火
  4. 이 정윤

My results, as you can see above, are in order. Latin first, then hanzi/kanji, then hangeul. Unicode is the closest I believe we can get to an easy sort, so I like to go by that. I'm not 100% sure on how Unicode assigned values to hanzi/kanji/hangeul, but I'm willing to trust the order they provided, especially because of how simple it is.

Muhammad Abdul-Rahim
  • 1,980
  • 19
  • 31
  • asort generally sorts based on byte values, not unicode codepoints, so I know this solution can't be complete. Many languages have modifiers in various unicode normalization forms which messes with the the order. Even for latin1 you could consider it incorrect if a last or firstname was spelled without a capital. – Evert Apr 30 '15 at 20:41
  • Fair point. For Latin scripts the concept of capital letters exists, but not so much for other languages. I see the point though, since each language has its own means of ordering alphabetically. I'll think about this some more. Here's related reading though: http://stackoverflow.com/questions/5698226/sort-for-japanese – Muhammad Abdul-Rahim Apr 30 '15 at 20:43
  • This exist: http://php.net/manual/en/collator.construct.php But the drawback is that I need to feed it the language for sorting. This is hard because my use-case is basically a massive address book from which I don't know the locales in advance =) – Evert Apr 30 '15 at 20:51