Detect non Latin characters with regex Pattern in Java

Question

I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results

"abcDE 123";  // Yes, this should match
"!@#$%^&*";   // Yes, this should match
"aaàààäää";   // Yes, this should match
"ベビードラ";   // No, this shouldn't match
"";  // No, this shouldn't match

My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.

Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
    System.out.println("is NON latin");
    return;
}
System.out.println("is latin");

I think you actually want to check if a string is ASCII, `Pattern LatinPattern = Pattern.compile("\\p{ASCII}");`. See https://ideone.com/FeTIiT — Wiktor Stribiżew, Jan 07 '21 at 22:17
Yeah, maybe ASCII is what I mean. The pattern you posted tests if any characters are ASCII. I'm interested in detecting if any characters are non ASCII. — XtevensChannel, Jan 07 '21 at 22:23
The question contradicts itself. It says that `"abcDE 123"` should match, but then complains that `\p{IsLatin}` matches and says you "want to detect if any characters are not Latin". So which is it? Do you want the regex to match or not to match, when string contains non-latin characters? Please **edit** the question and clarify what result you expect from the regex. — Andreas, Jan 07 '21 at 22:34
*FYI:* `\p{IsLatin}` matches Latin characters. The reverse test `\P{IsLatin}`, using uppercase `P`, matches anything that `\p{IsLatin}` doesn't, i.e. all non-Latin characters. `\P{IsLatin}` is the same as `[^\p{IsLatin}]`. — Andreas, Jan 07 '21 at 22:36
I'm expecting the regex to match if all characters are Latin characters, and to not match if any characters are non Latin characters. Is that possible? — XtevensChannel, Jan 07 '21 at 22:38

Andreas · Accepted Answer · 2021-01-07T23:14:10.117

TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$

You want a regex that matches if the string consists of:

Spaces
Digits
Punctuation
Latin characters (Unicode script "Latin")

Easiest way is to combine \p{IsLatin} with \p{Print}, where Pattern defines \p{Print} as:

\p{Print} - A printable character: [\p{Graph}\x20]
- \p{Graph} - A visible character: [\p{Alnum}\p{Punct}]
  - \p{Alnum} - An alphanumeric character: [\p{Alpha}\p{Digit}]
    - \p{Alpha} - An alphabetic character: [\p{Lower}\p{Upper}]
      - \p{Lower} - A lower-case alphabetic character: [a-z]
      - \p{Upper} - An upper-case alphabetic character: [A-Z]
    - \p{Digit} - A decimal digit: [0-9]
  - \p{Punct} - Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
- \x20 - A space:

Which makes \p{Print} the same as [\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.

The \p{Alpha} part overlaps with \p{IsLatin}, but that's fine, since the character class eliminates duplicates.

So, regex is: ^[\p{Print}\p{IsLatin}]*$

Test

Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");

String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "" };
for (String input : inputs) {
    System.out.print("\"" + input + "\": ");
    Matcher matcher = latinPattern.matcher(input);
    if (! matcher.find()) {
        System.out.println("is NON latin");
    } else {
        System.out.println("is latin");
    }
}

Output

"abcDE 123": is latin
"!@#$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"": is NON latin

wait...you sure the emojis are non latin? Seriously, this answer is beautiful. — aran, Jan 07 '21 at 23:13
@aran Yeah, I'm sure, and both OP and the Unicode "script" property agrees too. — Andreas, Jan 07 '21 at 23:16
@aran "Input Symbol For Latin Capital Letters" ([U+1F520](https://www.compart.com/en/unicode/U+1F520)) is not a ["Latin" Script](https://www.compart.com/en/unicode/scripts/Latn) character. — Andreas, Jan 07 '21 at 23:26

score 1 · Answer 2 · answered Jan 07 '21 at 23:52

All Latin Unicode character classes are:

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F

So, the answer is either

Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F

Note that underscores are removed from the Unicode property class names in Java.

See the Java demo:

List<String> strs = Arrays.asList(
        "abcDE 123",  // Yes, this should match
        "!@#$%^&*",   // Yes, this should match
        "aaàààäää",   // Yes, this should match
        "ベビードラ", // No, this shouldn't match
        "");     // No, this shouldn't match  
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
    Matcher matcher = LatinPattern.matcher(str);
    if (!matcher.find()) {
        System.out.println(str + " => is NON Latin");
        //return;
    } else {
        System.out.println(str + " => is Latin");
    }
}

Note: if you replace .find() with .matches(), you can throw away ^ and $ in the pattern.

Output:

abcDE 123 => is Latin
!@#$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
 => is NON Latin

Detect non Latin characters with regex Pattern in Java

2 Answers2