4

I have some text files that contain some non ASCII characters, I want to remove them, however keep the formatting characters.

I tried

$description = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $description);

However that appeared to strip newlines and other formatting out and also had problems with some Hebrew which then converted this

משפטים נוספים מהמומחה. נסו ותהנו! חג חנוכה שמח **************************************** חדש - האפליקציה היחידה שאומרת לך מה מצב הסוללה שלך ** NEW to version 1.1 - the expert talks!!! *

to this

1.4 :", ..."" ..."" 50 ..." . , . ! **************************************** - ** NEW to version 1.1 - the expert talks!!! *

kitenski
  • 639
  • 2
  • 16
  • 25

2 Answers2

3

That's not replacing non-ASCII characters... Ascii characters are inside of the range 0-127. So basically what you're trying to do is write a rexeg to convert one character set to another (not just replace out some of the characters, which is a lot harder)...

As for what you want to do, I think you want the iconv function... You'll need to know the input encoding, but once you do you can then tell it to ignore non-representable characters:

$text = iconv('UTF-8', 'ASCII//IGNORE', $text);

You could also use ISO-8859-1, or any other target character set you want.

ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • No, he's not trying to convert character sets. He's trying to remove characters outside the ASCII range from a UTF-8 string. Of course, your solution works because ASCII is a subset of UTF-8. With ISO-8859-1, he'd get non-ASCII characters and he could no longer use the string with functions that expect UTF-8. – Artefacto Aug 23 '10 at 17:17
1

What you're doing won't work because you're treating a UTF-8 string as if it were a single-byte encoding. You are actually removing portions of characters. If you must add the u flag to the regex expression to activate UTF-8 mode.

Since you want to leave only the control characters and the other ASCII range characters, you have to replace all the others with ''. So:

$description = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $description);

which gives for your input:

. ! ********************* - * NEW to version 1.1 - the expert talks!!! *
Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • thanks, but when I just tried that it gave me this as the output 1.4 : ", ..." " ..." " 50 ..." . , . ! **************************************** - ** NEW to version 1.1 – kitenski Aug 23 '10 at 21:23