How can I detect encoding of a string which can be japanese, chinese or english and convert to utf8 for display?

Question

On a php website, I get email from imap and save them in database.

On the other hand, I want to display some of them. That mailbox receive lot of english mails, but also japanese and chinese.

My problem with the following code is that I can't detect all charset. If I arrange the order of the array so chinese chars are ok, that became wrong for other charset.

<?php
$subject = "板ｲﾃ淌"; // can be japanese
$subject = "这间面积70平"; // can be chinese
$subject = "This string can have latin1 chars also";

function get_subject($subject);

$encs = array();
$enc[] = "Big5";
$enc[] = "big5";
$enc[] = "euc-kr";
$enc[] = "EU-CN";
$enc[] = "GB2312";
$enc[] = "ISO-8859-1";
$enc[] = "GBK";
$enc[] = "CP936";
$enc[] = "ASCII";
$enc[] = "JIS";
$enc[] = "UTF-8";
$enc[] = "EUC-JP";
$enc[] = "SJIS";
$enc[] = "latin1";
$encoding = mb_detect_encoding($this->object_message, $encs);
$subject = mb_convert_encoding($this->object_message, 'UTF-8', $encoding);
$subject = iconv ( 'utf-8', 'ISO-8859-2' , $subject );
return $subject;
?>

Just for the record, the string that you said "Chinese" is in fact Japanese... — Passerby, Apr 04 '13 at 02:22
Sorry for that, I just copy/pasted the firsts examples I find. fixed — Asenar, Apr 04 '13 at 21:46

score 2 · Accepted Answer · answered Apr 04 '13 at 09:57

2

If you can't display them, you can't put them into the database correctly either.

You can't detect what encoding bytes are in just by looking at the bytes, except for UTF-8 because it has unique and restricted patterns. This is what detect_encoding does and is therefore useless for everything but detecting between very small amount of encodings with exclusive properties.

When you receive the email, you should read the encoding header and use that encoding to convert the data to UTF-8. Do not convert to ISO-8859-2 because it's a tiny charset and you will lose most characters.

You could use PHP email parser which returns the email contents in UTF-8.

answered Apr 04 '13 at 09:57

Esailija

138,174
23
272
326

I need to first construct the "raw content" 'cause I get the mail from an external source with an imap connection. Something like http://stackoverflow.com/questions/10293614/how-to-get-raw-email-data-with-the-imap-extension I should adapt for me (I paste that here if anyone has the same problem). – Asenar Apr 06 '13 at 13:03
@Asenar What is the problem? get the raw content -> pass it to the library -> get parsed UTF-8 content. – Esailija Apr 06 '13 at 13:23
how to get the raw content is the problem. That "PHP email parser" works for "simple html or text body", but there is problems when there is attachments – Asenar Apr 07 '13 at 22:25
@Asenar ok so look for a library that does attachment, I am not your personal google slave :) – Esailija Apr 17 '13 at 07:31
Sorry for the notification, I just thought I had to uncheck it because that didn't answered the thing. I think I will contribute to that lib later to fix what's wrong – Asenar Apr 17 '13 at 07:38
2

@Asenar the exact library in my answer is not the answer but the principle is. The principle is to read encoding from email headers instead of guessing/detection which is **impossible**. Libraries execute this principle. I link to one library as a first result from google, just because it doesn't work for your special needs, doesn't make my answer invalid because my answer's point is not that specific library at all. – Esailija Apr 17 '13 at 07:40
I unchecked it not because of the attachments problems, I fixed that already. That library is just incomplete for handling subjects properly. but I thanks you for your help anyway, and even if that's not answer exactly the question, I re-validated your answer, it was just to be understandable for the next people who will read it. – Asenar Apr 17 '13 at 08:02
@Asenar The only value in my answer is basically that `mb_detect_encoding` is a worthless piece of crap. That is the answer to your question, the library is just a bonus. :P – Esailija Apr 17 '13 at 08:07

How can I detect encoding of a string which can be japanese, chinese or english and convert to utf8 for display?

1 Answers1