1

I have some xml files with figure spaces in it, I need to remove those with php. The utf-8 code for these is e2 80 a9. If I'm not mistaken php does not seem to like 6 byte utf-8 chars, so far at least I'm unable to find a way to delete the figure spaces with functions like preg_replace.

Anybody any tips or even better a solution to this problem?

Jeroen Beerstra
  • 191
  • 1
  • 1
  • 7

2 Answers2

2

Have you tried preg_replace('/\x{2007}/u', '', $stringWithFigureSpaces);?

U+2007 is the unicode codepoint for the FIGURE SPACE.

Please see my answer on a similar unicode-regex topic with PHP which includes information about the \x{FFFF}-syntax.

Regarding you comment about the non-working - the following works perfectly on my machine:

$ php -a
Interactive shell

php > $str = "a\xe2\x80\x87b";  // \xe2\x80\x87 is the FIGURE SPACE
php > echo preg_replace('/\x{2007}/u', '_', $str); // \x{2007} is the PCRE unicode codepoint notation for the U+2007 codepoint
a_b

What's you PHP version? Are you sure the character is a FIGURE SPACE at all? Can you run the following snippet on your string?

for ($i = 0; $i < strlen($str); $i++) {
    printf('%x ', ord($str[$i]));
}

On my test string this outputs

61 e2 80 87 62
a  |U+2007|  b

EDIT after OP comment:

\xe2\x80\xa9 is a PARAGRAPH SEPARATOR which is unicode codepoint U+2029, so your code should be preg_replace('/\x{2029}/u', '', $stringWithUglyCharacter);

Community
  • 1
  • 1
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
1

Maybe mb_convert_encoding function can help.

turbod
  • 1,988
  • 2
  • 17
  • 31