0

I m trying to curl this page and put the result in a HTML page. I used this code:

        $url= "https://web.archive.org/web/20160202021236/http://www.mpshopfashion.com";
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow 301 redirection

        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0');
        $html = curl_exec($ch);

The HTML page that is created looks correct when I open it with a browser but when I try to open this page with an editor , I see text like this :

à¤Ã×èͧ»ÃдѺῪÑè¹ à¤Ã×èͧ»ÃдѺῪÑè¹à¡ÒËÅÕ ÊÃéÍÂ¤Í ÊÃéÍ¢éÍÁ×Í µèÒ§ËÙ ¢Ò»ÅÕ¡-¢ÒÂÊè§

Instead of this

เครื่องประดับแฟชั่น เครื่องประดับแฟชั่นเกาหลี สร้อยคอ สร้อยข้อมือ ต่างหู ขายปลีก-ขายส่ง
mohamed
  • 173
  • 2
  • 2
  • 14

2 Answers2

1

Web sites typically declare their encoding in HTTP headers. Please note Content-Type in this screenshot from Firefox Developer Tools:

Firefox Developer Tools

TIS-620 is apparently a common legacy encoding used in Thailand (of course, UTF-8 has obsoleted all other encodings).

You editor should have a setting to select encoding, as well as access to the appropriate fonts and, sure, support for that specific encoding. Here's a screenshot from RJ TextEd:

RJ TextEd

As fallback option (after all, HTTP headers do not exist outside HTTP) HTML provides <meta> tags as an alternative to identify the encoding:

<meta http-equiv="Content-Type" content="text/html; charset=windows-874"/>

In this case we can see it doesn't even match HTTP headers.

Once more, it's up to the undisclosed specific editor you are using whether to write logic and implement meta tags checks to identify the encoding. There's simply no universal one-size-fits-all solution that works automagically in all editors ever.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • Is there a way to open the html and find the correct encoding . Because I need to be able to parse this html file and retrieve all href to parse them as well . Example of href : https://web.archive.org/web/20160201173728/http://www.mpshopfashion.com/shop/14/สร้อยคอแฟชั่น.html – mohamed Jun 19 '17 at 10:46
  • I've edited my answer. But if you are seeking a solution that works with all text editors in the world you're out of luck. When IT pioneers invented files and file systems in mid 20th century computers were barely big calculators and Unicode was years away from being invented. – Álvaro González Jun 19 '17 at 11:10
  • Thanks for your help. Is there a way to convert à¤Ã×èͧ»ÃдѺῪÑè¹ à¤Ã×èͧ»ÃдѺῪÑè¹à¡ÒËÅÕ ÊÃéÍÂ¤Í ÊÃéÍ¢éÍÁ×Í µèÒ§ËÙ ¢Ò»ÅÕ¡-¢ÒÂÊè§ to เครื่องประดับแฟชั่น เครื่องประดับแฟชั่นเกาหลี สร้อยคอ สร้อยข้อมือ ต่างหู ขายปลีก-ขายส่ง with php? – mohamed Jun 19 '17 at 11:17
  • Have you signed a non-disclosure agreement that prevents from telling what your text editor is? – Álvaro González Jun 19 '17 at 11:18
  • It's notepade++. but the problem for me is that I want to get the correct encoding without telling the editor which encoding I m using .. That mean , I need to fix it with PHP. I think I should use this function "mb_convert_encoding" but not sure about parameters – mohamed Jun 19 '17 at 11:34
  • Please read the manual page for `mb_convert_encoding()` before—you'll learn it's basically useless. And the reason is that a computer cannot read letters, it only reads binary data. How can you see `0001010010` and decide whether it's an `ñ` in encoding A or an `ü` in encoding B? Plus, how do you plan to integrate PHP with your editor? – Álvaro González Jun 19 '17 at 11:37
  • I don't plan to integrate PHP with my editor. I want to apply the correct encoding with php (just after scraping the page) before writing the result in the html file that I will create. Why if I copy and paste this text "เครื่องประดับแฟชั่น" in the editor, the encoding didn't change. I want to see the same result after I write the result with php. I m just asking which encoding I should apply with php to get the correct encoding – mohamed Jun 19 '17 at 11:51
  • First, my excuses, I misread `mb_convert_encoding()`—it's a perfectly valid tool to convert between encodings (many people want to use `mb_detect_encoding()` to auto-detect encoding, something the function does not really do). Secondly, if you want to normalise to a common encoding UTF-8 is the only sensible one in 2017, though if you absolutely need to boost editor auto-detection and bandwidth/disk is not an issue you could play with UTF-16 with BOM. Last but not least, at this point I honestly don't think I've understood your question so I hope there's at least something useful in my answer. – Álvaro González Jun 19 '17 at 12:03
  • Thanks for your help and sorry if I was not clear ( maybe because my english is not perfect) . what I need is to be able to convert à¤Ã×èͧ»ÃдѺῪÑè¹ à¤Ã×èͧ»ÃдѺῪÑè¹à¡ÒËÅÕ ÊÃéÍÂ¤Í ÊÃéÍ¢éÍÁ×Í µèÒ§ËÙ ¢Ò»ÅÕ¡-¢ÒÂÊè§ to เครื่องประดับแฟชั่น เครื่องประดับแฟชั่นเกาหลี สร้อยคอ สร้อยข้อมือ ต่างหู ขายปลีก-ขายส่ง using PHP. I was able to do that for other pages in chinese,hebrou and arabic but I m getting problem with Thai language. Thanks again – mohamed Jun 19 '17 at 18:12
0

It's probably about bad encoding settings on website or even in curl request. What about use some wrapper for curl, which is really hard to set in right way.

I can recommend use Guzzle for this.

https://github.com/guzzle/guzzle