0

I have this kind of string

sample İletişim form:: aşağıdaki formu

What I'm aiming is to extract the string that has a unicode/non-ascii character inside of it using preg_match or preg_match_all of php.

So I'm expecting a result of 2 İletişim and aşağıdaki word only.

Array
(
    [0] => İletişim 
    [1] => aşağıdaki
)

I just can't think of regular expression as I'm not good at it. Any aid is welcome.

Thank you so much.

Kenneth P.
  • 1,797
  • 3
  • 21
  • 31

2 Answers2

1

I think a beginning of solution you want is here: How do I detect non-ASCII characters in a string?

By using preg_match(), you could do smthg like this:

preg_match_all('/[^\s]*[^\x20-\x7f]+[^\s]*/', $string, $matches);
print_r($matches);

Or, without preg_match, you can use the function mb_detect_encoding() to test the encoding of the string. In your case, you could use it this way:

$matches = array_filter(explode(' ', $string), function($item) {
    return !mb_detect_encoding($item, 'ASCII', TRUE);
});
print_r($matches);

But the last one is a bit warped ^^

Community
  • 1
  • 1
Lebugg
  • 303
  • 1
  • 8
  • I tested the code and it return only the non-ascii characters not the whole string that contain that character. maybe yes it might be a step to achieve what I want. Thank you anyway – Kenneth P. Jun 05 '13 at 09:53
  • 1
    I found one that works well. Try it with preg_match_all(): `'/[^\s]*[^\x20-\x7f]+[^\s]*/'`; – Lebugg Jun 05 '13 at 10:05
1

You can use unicode properties:

$string = 'sample İletişim form:: aşağıdaki formu';
preg_match_all("/(\pL+)/u", $string, $matches); 
print_r($matches);

output:

Array
(
    [0] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

    [1] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

)
Toto
  • 89,455
  • 62
  • 89
  • 125
  • This one extract the other strings that doesn't have a non-ascii on it. But thanks for contributing. Appreciate it :) – Kenneth P. Jun 05 '13 at 12:26