How I can use Java Regex for Turkish characters to UTF-8

Question

I'm trying to do a regex operations in Java. But when I search in the Turkish text , I'm having trouble . For example;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir".

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. Or How do I convert Turkish characters(Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim).

Sorry, about my grammer mistakes!...

Ok but how convert to "yildirim" to "y[iı]ld[ıi]r[ıi]m". Dynamically, in every text ("İÖÜŞÇĞıöüşçğ") How can I determine this character. And convert to ("IOUSCGiouscg") — Junior Develepor, Aug 20 '15 at 12:21

nhahtdh · Accepted Answer · 2015-08-20T13:16:51.387

10

Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag:

Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

Demo on ideone

Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters.

Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. Therefore, ı, i, I, İ are considered the same character.

Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote()

edited Aug 20 '15 at 13:16

answered Aug 20 '15 at 12:31

nhahtdh

55,989
15
126
162

Do you know of a way to achieve the same behavior in .NET? Just curious. – Wiktor Stribiżew Aug 20 '15 at 12:47
@stribizhev: That's a good question. I thought `IgnoreCase | CulturalInvariant` would work, but it turns out that it doesn't. You might want to ask a new question? (I also want to know the answer) – nhahtdh Aug 20 '15 at 13:00
@stribizhev: Actually, in .NET, we can solve this problem by setting the appropriate culture (in this case, Turkish) - though it means that you must at least know the language of the input before you process it. – nhahtdh Aug 20 '15 at 13:07
I also checked `CultureInvariant` flag at first. I will study this when I have some time. – Wiktor Stribiżew Aug 20 '15 at 13:18

score 8 · Answer 2 · edited May 23 '17 at 12:25

8

The question in your comment is more complicated than the original one.

You can use

string=Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");

to convert "İÖÜŞÇĞıöüşçğ" to "IOUSCGıouscg" which is already sufficient for a case insensitive match as pointed out by nhahtdh. If you want to perform a case sensitive match, you have to add a .replace('ı', 'i') to match ı with i.

edited May 23 '17 at 12:25

Community

1
1

answered Aug 20 '15 at 12:55

Holger

285,553
42
434
765

Thanks your advice.I think I had a little trouble in explaining my problem, But I did solve my problem with this answer. – Junior Develepor Aug 20 '15 at 13:08
@Holger: How do you get a link to a comment? Thanks. – Sabuncu Aug 21 '15 at 13:46
1

@Sabuncu: right-click on the date/time right beside the user name and select “copy link location”. – Holger Aug 21 '15 at 14:31

score 0 · Answer 3 · answered May 30 '19 at 14:43

I am using this pattern.

public static boolean isAlphaNumericWithWhiteSpace(String text) {
        return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
    }

\p{L} matches a single code point in the category "letter".

\p{N} matches any kind of numeric character in any script.

score -1 · Answer 4 · edited Aug 27 '18 at 16:49

-1

git hub url for replacing the Turkish char https://gist.github.com/onuryilmaz/6034569

in java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.") will check whether the String contains Turkish charters.

edited Aug 27 '18 at 16:49

Robert

7,394
40
45
64

answered Aug 27 '18 at 16:00

Srinibas Rao

1
1

Can you edit your answer by surrounding the code snippet in back-ticks? That would improve readability. – Jose Aug 27 '18 at 16:38

How I can use Java Regex for Turkish characters to UTF-8

4 Answers4