22

Suppose to have a two-characters String, which should represent the ISO 639 country or language name.

You know, Locale class has two functions getISOLanguages and getISOCountries that return an array of String with all the ISO languages and ISO countries, respectively.

To check if a specific String object is a valid ISO language or ISO country I should look inside that arrays for a matching String. Ok, I can do that by using a binary search (e.g. Arrays.binarySearch or the ApacheCommons ArrayUtils.contains).

The question is: exists any utility (e.g. from Guava or Apache Commons libraries) that provides a cleaner way, e.g. a function that returns a boolean to validate a String as a valid ISO 639 language or ISO 639 Country?

For instance:

public static boolean isValidISOLanguage(String s)
public static boolean isValidISOCountry(String s)
mat_boy
  • 12,998
  • 22
  • 72
  • 116
  • Remember to check your string's length before you search the array (this or other way) – Dariusz Apr 10 '13 at 08:59
  • @Dariusz: I'm not sure I'd bother - at least if doing a hash lookup. Unless you expect to be given huge strings which would take a long time to hash, it seems like complexity for no proven significant benefit. – Jon Skeet Apr 10 '13 at 09:02
  • @JonSkeet Please, can you clarify? – mat_boy Apr 10 '13 at 09:22
  • @mat_boy: Clarify what, exactly? Which bit is unclear? – Jon Skeet Apr 10 '13 at 09:25
  • @JonSkeet Why in your opinion "_it seems like complexity for no proven significant benefit_"... – mat_boy Apr 10 '13 at 09:27
  • @mat_boy: Well exactly that: it makes the code more complex, and there would only be significant benefit if you were given lots of invalid strings which take a long time to look up. I suspect that for most applications that wouldn't be the case. – Jon Skeet Apr 10 '13 at 09:28
  • Well, maybe you are right! However, I added in your functions a check for `Pattern.matches("[a-z]+", s)` and `Pattern.matches("[A-Z]+", s)` just to be sure that Strings are, respectively, only alpha-chars in lowercase and uppercase. I want to throw an exception to provide a feedback about the missing validity of the String provided. – mat_boy Apr 10 '13 at 09:32
  • @mat_boy Matching these strings against a regex may take more time than a HashSet search. If there is a chance of your strings being longer than 2 chars, check for length. Then do a hash-based search. – Dariusz Apr 10 '13 at 09:57
  • @Dariusz Thank you! Now I have a method that accepct a String, first check for isValidISO...(). If it is not valid, then I check for length and then for Pattern type to eventually throw an Exception to give a feedback to the user. Am I right? – mat_boy Apr 10 '13 at 10:07
  • 1
    What happens after calling isValidISO() is up to you - whatever you want to report to the user is your choice. I would probably just say "invalid country code", but more information is usually better:) Just make sure that the message is clear. – Dariusz Apr 10 '13 at 10:10

2 Answers2

36

I wouldn't bother using either a binary search or any third party libraries - HashSet is fine for this:

public final class IsoUtil {
    private static final Set<String> ISO_LANGUAGES = Set.of(Locale.getISOLanguages());
    private static final Set<String> ISO_COUNTRIES = Set.of(Locale.getISOCountries());

    private IsoUtil() {}

    public static boolean isValidISOLanguage(String s) {
        return ISO_LANGUAGES.contains(s);
    }

    public static boolean isValidISOCountry(String s) {
        return ISO_COUNTRIES.contains(s);
    }
}

You could check for the string length first, but I'm not sure I'd bother - at least not unless you want to protect yourself against performance attacks where you're given enormous strings which would take a long time to hash.

EDIT: If you do want to use a 3rd party library, ICU4J is the most likely contender - but that may well have a more up-to-date list than the ones supported by Locale, so you would want to move to use ICU4J everywhere, probably.

Laurent
  • 14,122
  • 13
  • 57
  • 89
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I usually prefer third party library (like Guava and ApacheCommons) because they are frequently improved, while I cannot check my code continuously: it is better to change the library version than to read thousand codes. However, I really appreciate your answer. Thank you! – mat_boy Apr 10 '13 at 09:03
  • @mat_boy: How would you expect this code to change over time? It's already delegating to the JDK to find the actual list of countries and languages... – Jon Skeet Apr 10 '13 at 09:04
  • Well, it is not about this code, it is in principle :) Moreover, If I already did the import of a library, I usually prefer to use the methods from that libraries to make the code more readable. – mat_boy Apr 10 '13 at 09:05
  • 1
    @mat_boy: Okay, in that case, I suspect the answer is just "no", at least on the Guava side. It's possible Apache Commons has something, but given that it would be a pretty thin wrapper, I wouldn't *expect* it. If any third party library is appropriate here, it would be icu4j – Jon Skeet Apr 10 '13 at 09:08
  • Please add one `)` before the `;` in the third and fifth lines. Thank you! – mat_boy Apr 10 '13 at 09:09
  • 3
    @mat_boy If you're already using Guava, you can use [`ImmutableSet`](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/ImmutableSet.html) which is a perfect use case for static final constants, plus the code is less cluttered: `private static final Set ISO_LANGUAGES = ImmutableSet.copyOf(Locale.getISOLanguages());` – Grzegorz Rożniecki Apr 10 '13 at 09:35
  • @Xaerxess Yes, I'm using it! Thank you! – mat_boy Apr 10 '13 at 09:36
  • This will work slower than binarySearch() and this will use a lot of memory. – Sergey Ponomarev Dec 14 '18 at 14:03
  • @stokito: It's a hash set - why do you expect that to be slow? – Jon Skeet Dec 14 '18 at 14:12
  • Because an array is more friendly to a CPU cache. In most situations even linear search can work faster, especially if the country closer to begin. Sorting countries by population may be quite efficient optimization but this will be very speculative optimization. You can write a JMH benchmark but I’m pretty sure that here complexity theory doesn’t fit with a hardware. – Sergey Ponomarev Dec 14 '18 at 22:17
  • BTW in jdk 9 to Locale was added a method which returns a Set – Sergey Ponomarev Dec 14 '18 at 22:31
  • @stokito: I think we'll have to agree to disagree. Rather than relying on undocumented behavior, I'd prefer to just use a set (using the Java 9 call now). I certainly wouldn't start trying to microoptimize based on *assumptions* before even knowing whether this is even significant in terms of performance. The memory usage will be tiny compared with the rest of almost any realistic application, and I'd be *astonished* if this were slow enough to be noticeable at all unless you're doing *nothing* but checking ISO country codes - which seems unlikely to me. – Jon Skeet Dec 15 '18 at 08:13
1

As far I know there is no any such method in any library but at least you can declare it yourself like:

import static java.util.Arrays.binarySearch;
import java.util.Locale;

/**
 * Validator of country code.
 * Uses binary search over array of sorted country codes.
 * Country code has two ASCII letters so we need at least two bytes to represent the code.
 * Two bytes are represented in Java by short type. This is useful for us because we can use Arrays.binarySearch(short[] a, short needle)
 * Each country code is converted to short via countryCodeNeedle() function.
 *
 * Average speed of the method is 246.058 ops/ms which is twice slower than lookup over HashSet (523.678 ops/ms).
 * Complexity is O(log(N)) instead of O(1) for HashSet.
 * But it consumes only 520 bytes of RAM to keep the list of country codes instead of 22064 (> 21 Kb) to hold HashSet of country codes.
 */
public class CountryValidator {
  /** Sorted array of country codes converted to short */
  private static final short[] COUNTRIES_SHORT = initShortArray(Locale.getISOCountries());

  public static boolean isValidCountryCode(String countryCode) {
    if (countryCode == null || countryCode.length() != 2 || countryCodeIsNotAlphaUppercase(countryCode)) {
      return false;
    }
    short needle = countryCodeNeedle(countryCode);
    return binarySearch(COUNTRIES_SHORT, needle) >= 0;
  }

  private static boolean countryCodeIsNotAlphaUppercase(String countryCode) {
    char c1 = countryCode.charAt(0);
    if (c1 < 'A' || c1 > 'Z') {
      return true;
    }
    char c2 = countryCode.charAt(1);
    return c2 < 'A' || c2 > 'Z';
  }

  /**
   * Country code has two ASCII letters so we need at least two bytes to represent the code.
   * Two bytes are represented in Java by short type. So we should convert two bytes of country code to short.
   * We can use something like:
   * short val = (short)((hi << 8) | lo);
   * But in fact very similar logic is done inside of String.hashCode() function.
   * And what is even more important is that each string object already has cached hash code.
   * So for us the conversion of two letter country code to short can be immediately.
   * We can relay on String's hash code because it's specified in JLS
   **/
  private static short countryCodeNeedle(String countryCode) {
    return (short) countryCode.hashCode();
  }

  private static short[] initShortArray(String[] isoCountries) {
    short[] countriesShortArray = new short[isoCountries.length];
    for (int i = 0; i < isoCountries.length; i++) {
      String isoCountry = isoCountries[i];
      countriesShortArray[i] = countryCodeNeedle(isoCountry);
    }
    return countriesShortArray;
  }
}

The Locale.getISOCountries() will always create a new array so we should store it into a static field to avoid non necessary allocations. In the same time HashSet or TreeSet consumes a lot of memory so this validator will use a binary search on array. This is a trade off between speed and memory.

Sergey Ponomarev
  • 2,947
  • 1
  • 33
  • 43
  • I see no guarantee in the documentation that the value returned by `Locale.getISOCountries()` is sorted, which is required for a binary search to work. You could sort it first of course, but that should be part of the answer. – Jon Skeet Dec 14 '18 at 14:14
  • Fairly good point but we can be sure that it will be always sorted. And yes, javadoc should clearly state this. This is a good candidate to send a pull request to JDK – Sergey Ponomarev Dec 14 '18 at 22:30
  • Wow, I really, really wouldn't just trust existing behaviour like that. I'd definitely sort it. But then, I'd use a `HashSet` as per my answer anyway, at which point it doesn't matter. – Jon Skeet Dec 15 '18 at 08:11