4

What is the best way for maintaining a function that provides identical collation in both Perl and Java when comparing strings? Here's the sample function in Perl:

sub compare_strs
{
    my ( $str1, $str2 ) = @_;
    # Treat vars as strings by quoting. 
    # Possibly incorrect/irrelevant approach. 
    return ("$str1" cmp "$str2");
}

Concerns here are:

  • The string can contain Chinese/Japanese characters. The Perl code above cannot be depended upon to give the expected result.
  • How does one guarantee that both Perl and Java implementations can perform string collations in an identical manner?
  • syker
    • 10,912
    • 16
    • 56
    • 68
    • 3
      It’s really really really hard to understand what you are asking here. Please give examples, and include what your concerns are. – tchrist Jul 26 '13 at 20:08
    • I'm not sure I understand the question. Are you asking how to write code in Perl and Java that orderes unicode strings, and guarantees the same ordering in the Perl and Java implementations? – amon Jul 26 '13 at 20:09
    • 1
      Why are you quoting those variables, BTW? And what do you mean ***may*** contain Unicode characters anyway? Anyway, you should be using a collation module if you want identical collation. You should not be doing bitwise equivalence or order: that really doesn’t make sense in Unicode. Use Unicode::Collate in Perl. – tchrist Jul 26 '13 at 20:27
    • You haven’t said what your idea of “the expected result” is, so it’s hard to answer. I assure you that Perl will give the expected result once your expectations are correctly established. I assume you want the Japanese sorted according to JIS X 0208, but which of the six Chinese collations do you want, and how do you plan to mix those? See the Unicode::Collate::Locale module. – tchrist Jul 26 '13 at 21:19
    • @tchrist The expected result was just to easily say if one string is less than another and have it work for any type of character set. Let me put some research into these modules. – syker Jul 26 '13 at 22:35
    • But what does it mean for one string to be “less than” another string? “Less than” is a property of numbers, and you are talking about a string. – tchrist Jul 26 '13 at 22:42
    • An example use case of why this might be helpful: binary searching strings. For example this [module](http://search.cpan.org/~davido/List-BinarySearch-0.11/lib/List/BinarySearch.pm#UNICODE_SUPPORT) uses Unicode::Collate for string comparison. Investigating how Java's [Collator](http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html) class can play into binary search as well. – syker Jul 26 '13 at 23:38

    1 Answers1

    4

    For Perl, don't use the cmp operator. Instead, you should be using the Unicode::Collate module:

    use Unicode::Collate;
    sub compare_strs
    {
        my ( $str1, $str2 ) = @_;
        # Treat vars as strings by quoting. 
        # Possibly incorrect/irrelevant approach. 
        return $Collator->cmp("$str1", "$str2");
    }
    

    If you're worried about normalization (e.g., order of combining marks), you can also use the Unicode::Normalize module.

    In Java, use the Collator class, as described in the tutorial on comparing strings. For normalization, see the tutorial on normalizing text. The required classes were introduced in Java 1.6; if you need to support earlier versions of Java, you will need to use something like the ICU libraries.

    Using the appropriate tools as described above should ensure that both environments behave according to the Unicode collation algorithm (and hence compatibly with one another).

    Ted Hopp
    • 232,168
    • 48
    • 399
    • 521
    • 1
      What do the double quotes buy you? – tchrist Jul 26 '13 at 20:46
    • @tchrist - Most of the time (actually, almost all of the time) it gets you nothing. However, there are rare cases where explicitly stringifying a variable might make a difference. See [this thread](http://stackoverflow.com/a/9158770/535871). – Ted Hopp Jul 26 '13 at 20:56
    • 1
      I know what it does. I wanted to understand why you thought they were a good idea. – tchrist Jul 26 '13 at 21:08
    • @tchrist - If I was coding instead of copy-and-paste from OP's original question, I wouldn't have used the quotes. _"Why do you think they are a good idea"_ (I don't) and _"What do they buy you"_ aren't exactly the same question. :) – Ted Hopp Jul 26 '13 at 21:13
    • 1
      @tchrist - Yikes. I just looked at your profile. Didn't realize who I was talking to! – Ted Hopp Jul 26 '13 at 21:15
    • Unicode::Collate (and the possibly preferable Unicode::Collate::Locale add-on module) already take normalization into account. – tchrist Jul 26 '13 at 22:43