I’m working on extracting the content of the title tag from webpages. The problem is that some content needs UTF8 decoding to correctly be displayed and other need it twice! An example of this is the title of (http://nestekaltimontie.com/) & the title of (http://www.pizzaexpresscafe.fi/) The first one needs decoding twice and the first needs one decoding. My question is how do I know how many times I need to apply UTF8-decoding for correct text display? Or is there any proper way to correctly display the title text of both websites?. I have tried some of the methods mentioned in Stack Overflow for decoding and encoding like Encoding::toutf8(), mb_internal_encoding("UTF-8"), iconv,utf8_encode but none of them work with my examples. My code to extract the title is as follows:
mb_internal_encoding("UTF-8");
require_once("simple_html_dom.php");
function gettitle($link)
{
$html = file_get_html($link);
$dom = new DOMDocument;
$dom->loadHTML($html);
var_dump($dom);
$xpath = new DOMXPath($dom);
$entries = $xpath->query('//html/head/title');
foreach ($entries as $entry) {
$title = $entry->nodeValue;
}
echo utf8_decode($title);