2

arghhh, it's not easy. I'm trying to parse some mails with perl. Let's take an example:

From: abc@def.de
Content-Type: multipart/mixed;
        boundary="----_=_NextPart_001_01CBE273.65A0E7AA"
To: ghi@def.de

This is a multi-part message in MIME format.

------_=_NextPart_001_01CBE273.65A0E7AA
Content-Type: multipart/alternative;
        boundary="----_=_NextPart_002_01CBE273.65A0E7AA"


------_=_NextPart_002_01CBE273.65A0E7AA
Content-Type: text/plain;
        charset="UTF-8"
Content-Transfer-Encoding: base64

[base64-content]
------_=_NextPart_002_01CBE273.65A0E7AA
Content-Type: text/html;
        charset="UTF-8"
Content-Transfer-Encoding: base64

[base64-content]
------_=_NextPart_002_01CBE273.65A0E7AA--
------_=_NextPart_001_01CBE273.65A0E7AA
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit

X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="----_=_NextPart_003_01CBE272.13692C80"
From: bla@bla.de
To: xxx@xxx.de

This is a multi-part message in MIME format.

------_=_NextPart_003_01CBE272.13692C80
Content-Type: multipart/alternative;
        boundary="----_=_NextPart_004_01CBE272.13692C80"


------_=_NextPart_004_01CBE272.13692C80
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

=20

Viele Gr=FC=DFe

------_=_NextPart_004_01CBE272.13692C80
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>...</html>
------_=_NextPart_004_01CBE272.13692C80--
------_=_NextPart_003_01CBE272.13692C80
Content-Type: application/x-zip-compressed;
        name="abc.zip"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
        filename="abc.zip"

[base64-content]

------_=_NextPart_003_01CBE272.13692C80--
------_=_NextPart_001_01CBE273.65A0E7AA--

This mail is sent from Outlook with another attached message. As you can see, this is a very complex mail with many different content types (text/plain, text/html, message/rfc_822, application/xyz)... And the rfc_822 part is the problem. I've written a script in Perl 5.8 (Debian Squeeze) to parse this message with MIME::Parser.

use MIME::Parser;
my $parser = MIME::Parser->new;
$parser->output_to_core(1);
my $top_entity = $parser->parse(\*STDIN);
my $plain_body = "";
my $html_body = "";
my $content_type;
foreach my $part ($top_entity->parts_DFS) {
    $content_type = $part->effective_type;
    $body = $part->bodyhandle;
    if ($body) {
        if ($content_type eq 'text/plain') {
            $plain_body = $plain_body . "\n" if ($plain_body ne '');
            $plain_body = $plain_body . $body->as_string;
        } elsif ($content_type eq 'text/html') {
            $html_body = $html_body . "\n" if ($html_body ne '');
            $html_body = $html_body . $body->as_string;
        }
    }
}
# parsing of attachment comes later
print $plain_body;

The first message part (base64-content) contains german umlauts, which are shown correctly at STDOUT. The nested rfc_822 message is parsed by MIME::Parser automatically and is pooled with the top level body as one entity. This nested rfc_822 contains also german umlauts in quoted-printable as you can see. But these are not shown correctly at STDOUT. When doing a

utf8::encode($plain_body);

before print, the quoted-printable umlauts are shown correctly, but not the base64 encoded ones. I'm trying now for hours to extract the rfc_822 seperatly and doing some encoding, but nothing helps. Who else can help?

Regards

rabudde
  • 7,498
  • 6
  • 53
  • 91

1 Answers1

1

Assuming that your console displays UTF-8, this make sense. It correctly shows what you have decoded, but, of course, latin1 characters are not shown correctly.
Later, you do a conversion to UTF-8, but this does not make sense if the data is already UTF8. So only the former latin1 umlauts are shown.

There is no way to get this right without looking at the "charset" in the content-type and acting accordingly.

Ingo
  • 36,037
  • 5
  • 53
  • 100
  • Ok, thanks. I understand what's the problem. I'm using now a PHP script, which I'm much former with. – rabudde May 16 '11 at 04:41