Extract any unicode string occurence within a string using preg_match

Question

I have this kind of string

sample İletişim form:: aşağıdaki formu

What I'm aiming is to extract the string that has a unicode/non-ascii character inside of it using preg_match or preg_match_all of php.

So I'm expecting a result of 2 İletişim and aşağıdaki word only.

Array
(
    [0] => İletişim 
    [1] => aşağıdaki
)

I just can't think of regular expression as I'm not good at it. Any aid is welcome.

Thank you so much.

score 1 · Accepted Answer · edited May 23 '17 at 11:57

1

I think a beginning of solution you want is here: How do I detect non-ASCII characters in a string?

By using preg_match(), you could do smthg like this:

preg_match_all('/[^\s]*[^\x20-\x7f]+[^\s]*/', $string, $matches);
print_r($matches);

Or, without preg_match, you can use the function mb_detect_encoding() to test the encoding of the string. In your case, you could use it this way:

$matches = array_filter(explode(' ', $string), function($item) {
    return !mb_detect_encoding($item, 'ASCII', TRUE);
});
print_r($matches);

But the last one is a bit warped ^^

edited May 23 '17 at 11:57

Community

1
1

answered Jun 05 '13 at 09:42

Lebugg

303
1
8

I tested the code and it return only the non-ascii characters not the whole string that contain that character. maybe yes it might be a step to achieve what I want. Thank you anyway – Kenneth P. Jun 05 '13 at 09:53
1

I found one that works well. Try it with preg_match_all(): `'/[^\s]*[^\x20-\x7f]+[^\s]*/'`; – Lebugg Jun 05 '13 at 10:05

score 1 · Answer 2 · answered Jun 05 '13 at 11:26

You can use unicode properties:

$string = 'sample İletişim form:: aşağıdaki formu';
preg_match_all("/(\pL+)/u", $string, $matches); 
print_r($matches);

output:

Array
(
    [0] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

    [1] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

)

This one extract the other strings that doesn't have a non-ascii on it. But thanks for contributing. Appreciate it :) — Kenneth P., Jun 05 '13 at 12:26

Extract any unicode string occurence within a string using preg_match

2 Answers2