8

I have read some other questions, tried the answers but got no result at the end. What I get is for example this

Μήπως θα έπρεπε να � ...

and I can't remove that weird question mark. What I do is to get the content of an RSS feed that is encoded also to <?xml version="1.0" encoding="UTF-8"?> using Greek language for the content.

Is there any way to fix this?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<div><?php
    $entry->description = strip_tags($entry->description);
    echo mb_substr($entry->description, 0, 490);
?> ...</div>
EnexoOnoma
  • 8,454
  • 18
  • 94
  • 179
  • What is `$entry`? Could the issue be the encoding used to store the description text? – Abdullah Jibaly Jul 10 '11 at 05:20
  • I have updated my question. What it does, it gets the content of a feed – EnexoOnoma Jul 10 '11 at 05:23
  • The "funny question mark" is a real character, called the REPLACEMENT CHARACTER. It probably got added to the data because the stream from your feed was not legal UTF-8, that is, it could not be decoded. Can you show us the content of the string $entry like Abdullah suggests? Perferable as a byte sequence, not a char sequence? And, are you sure the original feed data was encoded in UTF-8? – Ray Toal Jul 10 '11 at 05:23
  • Do you get the same encoding error if you don't use `mb_substr`? – Abdullah Jibaly Jul 10 '11 at 05:25
  • When I echo it without mb_substr I dont get the question mark. This is a feed I use http://feeds.feedburner.com/blogspot/hyMBI – EnexoOnoma Jul 10 '11 at 05:38

3 Answers3

18

This is the answer

mb_substr($entry->description, 0, 490, "UTF-8");
EnexoOnoma
  • 8,454
  • 18
  • 94
  • 179
12

I believe the issue is with your encoding. Your outputting UTF-8 but your browser cannot interpret one of the characters. The question mark symbol as I have known it in the past is actually generated by the browser, so there is no search and replace....it's about fixing your encoding OR eliminating unknown characters from the string before outputting it...

If you have access to the source of data, then you may want to check the DB settings to make sure it's encoded properly...if not, then you'll have to find someway to convert the data over using php...not an easy task...

Perhaps:

mb_convert_encoding($string, "UTF-8");
espradley
  • 2,138
  • 2
  • 17
  • 15
  • 1
    +1 Looks like you sent the OP down the right direction with the "UTF-8" argument, not sure why someone would down vote this. – Abdullah Jibaly Jul 11 '11 at 04:33
  • 1
    Thank you espradley. If I could upvote this 7000 times, I would. I have escaped charset jail. This works for fixing things at template level. – Tom Feb 26 '15 at 19:02
0

Have you tried using these seemingly redundant multibyte safe string functions which are not in the php core?

http://code.google.com/p/mbfunctions/

It appears they offer an mb_strip_tags() function like such:

if (! function_exists('mb_strip_tags'))
{
   function mb_strip_tags($document,$repl = ''){
      $search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
                     '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
                     '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
                     '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
      );
      $text = mb_preg_replace($search, $repl, $document);
      return $text;
   }
}
AlienWebguy
  • 76,997
  • 17
  • 122
  • 145