arghhh, it's not easy. I'm trying to parse some mails with perl. Let's take an example:
From: abc@def.de
Content-Type: multipart/mixed;
boundary="----_=_NextPart_001_01CBE273.65A0E7AA"
To: ghi@def.de
This is a multi-part message in MIME format.
------_=_NextPart_001_01CBE273.65A0E7AA
Content-Type: multipart/alternative;
boundary="----_=_NextPart_002_01CBE273.65A0E7AA"
------_=_NextPart_002_01CBE273.65A0E7AA
Content-Type: text/plain;
charset="UTF-8"
Content-Transfer-Encoding: base64
[base64-content]
------_=_NextPart_002_01CBE273.65A0E7AA
Content-Type: text/html;
charset="UTF-8"
Content-Transfer-Encoding: base64
[base64-content]
------_=_NextPart_002_01CBE273.65A0E7AA--
------_=_NextPart_001_01CBE273.65A0E7AA
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----_=_NextPart_003_01CBE272.13692C80"
From: bla@bla.de
To: xxx@xxx.de
This is a multi-part message in MIME format.
------_=_NextPart_003_01CBE272.13692C80
Content-Type: multipart/alternative;
boundary="----_=_NextPart_004_01CBE272.13692C80"
------_=_NextPart_004_01CBE272.13692C80
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
=20
Viele Gr=FC=DFe
------_=_NextPart_004_01CBE272.13692C80
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<html>...</html>
------_=_NextPart_004_01CBE272.13692C80--
------_=_NextPart_003_01CBE272.13692C80
Content-Type: application/x-zip-compressed;
name="abc.zip"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="abc.zip"
[base64-content]
------_=_NextPart_003_01CBE272.13692C80--
------_=_NextPart_001_01CBE273.65A0E7AA--
This mail is sent from Outlook with another attached message. As you can see, this is a very complex mail with many different content types (text/plain, text/html, message/rfc_822, application/xyz)... And the rfc_822 part is the problem. I've written a script in Perl 5.8 (Debian Squeeze) to parse this message with MIME::Parser.
use MIME::Parser;
my $parser = MIME::Parser->new;
$parser->output_to_core(1);
my $top_entity = $parser->parse(\*STDIN);
my $plain_body = "";
my $html_body = "";
my $content_type;
foreach my $part ($top_entity->parts_DFS) {
$content_type = $part->effective_type;
$body = $part->bodyhandle;
if ($body) {
if ($content_type eq 'text/plain') {
$plain_body = $plain_body . "\n" if ($plain_body ne '');
$plain_body = $plain_body . $body->as_string;
} elsif ($content_type eq 'text/html') {
$html_body = $html_body . "\n" if ($html_body ne '');
$html_body = $html_body . $body->as_string;
}
}
}
# parsing of attachment comes later
print $plain_body;
The first message part (base64-content) contains german umlauts, which are shown correctly at STDOUT. The nested rfc_822 message is parsed by MIME::Parser automatically and is pooled with the top level body as one entity. This nested rfc_822 contains also german umlauts in quoted-printable as you can see. But these are not shown correctly at STDOUT. When doing a
utf8::encode($plain_body);
before print, the quoted-printable umlauts are shown correctly, but not the base64 encoded ones. I'm trying now for hours to extract the rfc_822 seperatly and doing some encoding, but nothing helps. Who else can help?
Regards