Issues with UTF8 text decoding

Question

I’m working on extracting the content of the title tag from webpages. The problem is that some content needs UTF8 decoding to correctly be displayed and other need it twice! An example of this is the title of (http://nestekaltimontie.com/) & the title of (http://www.pizzaexpresscafe.fi/) The first one needs decoding twice and the first needs one decoding. My question is how do I know how many times I need to apply UTF8-decoding for correct text display? Or is there any proper way to correctly display the title text of both websites?. I have tried some of the methods mentioned in Stack Overflow for decoding and encoding like Encoding::toutf8(), mb_internal_encoding("UTF-8"), iconv,utf8_encode but none of them work with my examples. My code to extract the title is as follows:

mb_internal_encoding("UTF-8");
require_once("simple_html_dom.php");
function gettitle($link)
{
    $html = file_get_html($link);
    $dom  = new DOMDocument;
    $dom->loadHTML($html);
    var_dump($dom);
    $xpath   = new DOMXPath($dom);
    $entries = $xpath->query('//html/head/title');
    foreach ($entries as $entry) {
        $title = $entry->nodeValue;
    }
    echo utf8_decode($title);

are you sure you want to use iso88591 chars, in a UTF-8 encoded document? — Droa, Aug 25 '14 at 09:01
@Droa, although the charset of Pizzamaster is 88591, I got UTF-8 when using mb-detect-encoding(). The problem is that the websites that I need to extract their titles are mix of those. and I don't know the proper way to handle such issue. I'm not a php expert, therefore I need some help and advices at some point. — Nan, Aug 25 '14 at 09:10
well, for what i can see, you told the document to work as all input is considered a UTF-8 input, and then down the line, you input a utf-8 string, to be converted to a iso88591 (ASCII).. this should be ok IF your output is set to ASCII, however all other text on your page will be in wierd symbols. — Droa, Aug 25 '14 at 09:13
as fare as i know, the mb_internal_encoding does not work cross imported documents, it might be becasue you are using a iso88591 string not converted in your simple_html_dom.php document — Droa, Aug 25 '14 at 09:20
The above code work just fine in case of pizza express website, but in case of neste website it displays (Neste Kaltimontie, TolosenmÃƒÂ¤ki ja Sortavalankatu) unless I double decode like utf8_decode(utf8_decode($title));. Is there any way to handle both cases? What I need is to display both titles correctly. Thanks! — Nan, Aug 25 '14 at 09:28
you could try use mb_detect_encoding on the $title, if will return the encodingname of your variable, however it sounds like the $title variable is not UTF-8 at all — Droa, Aug 25 '14 at 09:50
Tried it again just now, both gave UTF-8 with detect encoding! — Nan, Aug 25 '14 at 10:07
this will help to convert your text to UTF8 even if its UTF8 without become garbge text: https://github.com/neitanod/forceutf8 — Wajdy Essam, Aug 25 '14 at 12:43
@ Wajdy, I tried this one but it doesn't work either, Both force and fix are giving incorrect output in case of Neste website!. I need to apply string similarity matching after extraction, therefore, having the title in its correct format is mandatory for correct matching results. — Nan, Aug 26 '14 at 07:44

Issues with UTF8 text decoding

0 Answers0