C++ decode e-mail's subject

Question

I've downloaded mails with Poco/Net/POP3ClientSession, I wanted to convert e-mail subject into human readable, so I tried to use neagoegab's solution from here: https://stackoverflow.com/a/8104496/1350091 unfortunately it doesn't work:

#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MailMessage.h>
#include <iostream>
#include <string>
using namespace std;
using namespace Poco::Net;


#include <iconv.h>

const size_t BUF_SIZE=1024;


class IConv {
    iconv_t ic_;
public:
    IConv(const char* to, const char* from)
        : ic_(iconv_open(to,from))    { }
    ~IConv() { iconv_close(ic_); }

     bool convert(char* input, char* output, size_t& out_size) {
        size_t inbufsize = strlen(input)+1;
        return iconv(ic_, &input, &inbufsize, &output, &out_size);
     }
};


int main()
{
    POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");

    POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);
    cout << "id: " << messages[0].id << " size: " << messages[0].size << endl;

    MailMessage message;
    session.retrieveMessage(messages[0].id, message);
    const string subject = message.getSubject();


    cout << "Original subject: " << subject << endl;

    IConv iconv_("UTF8","ISO-8859-2");


    char from[BUF_SIZE];// "=?ISO-8859-2?Q?Re: M=F3j sen o JP II?=";
    subject.copy(from, sizeof(from));
    char to[BUF_SIZE] = "bye";
    size_t outsize = BUF_SIZE;//you will need it

    iconv_.convert(from, to, outsize);
    cout << "converted: " << to << endl;
}

The output is:

id: 1 size: 2792
Original subject: =?ISO-8859-2?Q?Re: M=F3j sen o JP II?=
converted: =?ISO-8859-2?Q?Re: M=F3j sen o JP II?=

The interesting thing is that when I try to convert the subject with POCO it fails:

cout << "Encoded with POCO: " << MailMessage::encodeWord("Re: Mój sen o JP II", "ISO-8859-2") << endl; // output: Encoded with POCO: =?ISO-8859-2?q?Re=3A_M=C3=B3j_sen_o_JP_II?=

But the subject I want to receive is: "Re: Mój sen o JP II" The only succesfull way I found to convert the subject is: https://docs.python.org/2/library/email.header.html#email.header.decode_header

So my question is -how to convert e-mail's subject in C++ into some format like UTF-8?

Find the relevant RFC, code it up. As I recall mail and NNTP messages use slightly different conventions. — Cheers and hth. - Alf, Jan 02 '17 at 07:42
@Alf before writing any code yourself, research whether someone already did the work for you. Especially with established RFCs, there are lots of existing implementations. — Roland Illig, Jan 05 '17 at 08:05
I just submitted https://github.com/pocoproject/poco/issues/1543. — Roland Illig, Jan 05 '17 at 23:54
Technically, those spaces do not meet the spec for encoded words, however, any real library should cope with them. — Max, Jan 09 '17 at 20:24
[The issue has been fixed.](https://github.com/pocoproject/poco/issues/1543) in November 2017. You should update to 1.9.0 and make your code simpler now. — Roland Illig, Apr 28 '19 at 13:55

score 4 · Answer 1 · edited Oct 07 '21 at 08:57

The relevant RFC to your situation is RFC 2047. That RFC specifies how non-ASCII data should be encoded in mail messages. The basic gist is that all bytes besides printable ASCII characters are escaped as an '=' character followed by two hexadecimal digits. Since "ó" is represented by the byte 0xF3 in ISO-8859-2, and 0xF3 is not a printable ASCII character, it is encoded as "=F3". You'll need to decode all of the encoded characters in your message.

baziorek · Accepted Answer · 2017-01-05T07:32:37.447

I found out how to solve the problem (I'm not sure that it is 100% correct solution), but it looks like it is enough to use: Poco::UTF8Encoding::convert to convert from characterCode to utf8:

#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MessageHeader.h>
#include <Poco/Net/MailMessage.h>
#include <Poco/UTF8Encoding.h>
#include <iostream>
#include <string>

using namespace std;
using namespace Poco::Net;

class EncoderLatin2
{
public:
    EncoderLatin2(const string& encodedSubject)
    {
        ///    encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
        int charsetBeginPosition = strlen("=?");
        int charsetEndPosition = encodedSubject.find("?", charsetBeginPosition);
        charset = encodedSubject.substr(charsetBeginPosition, charsetEndPosition-charsetBeginPosition);

        int encodingPosition = charsetEndPosition + strlen("?");
        encoding = encodedSubject[encodingPosition];

        if ("ISO-8859-2" != charset)
            throw std::invalid_argument("Invalid encoding!");

        const int lenghtOfEncodedText = encodedSubject.length() - encodingPosition-strlen("?=")-2;
        extractedEncodedSubjectToConvert = encodedSubject.substr(encodingPosition+2, lenghtOfEncodedText);
    }

    string convert()
    {
        size_t positionOfAssignment = -1;

        while (true)
        {
            positionOfAssignment = extractedEncodedSubjectToConvert.find('=', positionOfAssignment+1);
            if (string::npos != positionOfAssignment)
            {
                const string& charHexCode = extractedEncodedSubjectToConvert.substr(positionOfAssignment + 1, 2);
                replaceAllSubstringsWithUnicode(extractedEncodedSubjectToConvert, charHexCode);
            }
            else
                break;
        }
        return extractedEncodedSubjectToConvert;
    }

    void replaceAllSubstringsWithUnicode(string& s, const string& charHexCode)
    {
        const int charCode = stoi(charHexCode, nullptr, 16);

        char buffer[10] = {};
        encodingConverter.convert(charCode, (unsigned char*)buffer, sizeof(buffer));
        replaceAll(s, '=' + charHexCode, buffer);
    }

    void replaceAll(string& s, const string& replaceFrom, const string& replaceTo)
    {
        size_t needlePosition = -1;
        while (true)
        {
            needlePosition = s.find(replaceFrom, needlePosition + 1);
            if (string::npos == needlePosition)
                break;

            s.replace(needlePosition, replaceFrom.length(), replaceTo);
        }
    }


private:
    string charset;
    char encoding;
    Poco::UTF8Encoding encodingConverter;

    string extractedEncodedSubjectToConvert;
};

int main()
{
    POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");


    POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);

    MessageHeader header;
    MailMessage message;

    auto currentMessage = messages[0];

    session.retrieveHeader(currentMessage.id, header);
    session.retrieveMessage(currentMessage.id, message);

    const string subject = message.getSubject();

    EncoderLatin2 encoder(subject);
    cout << "Original subject: " << subject << endl;
    cout << "Encoded: " << encoder.convert() << endl;
}

baziorek · Answer 3 · 2017-01-09T06:14:46.497

I found another solution, better than before. Some e-mails subjects has different encodings, I noticed:

Latin2, encoded like: =?ISO-8859-2?Q?...?=
UTF-8 Base64 like: =?utf-8?B?Wm9iYWN6Y2llIGNvIGRsYSBXYXMgcHJ6eWdvdG93YWxpxZtteSAvIHN0eWN6ZcWEIHcgTGFzZXJwYXJrdQ==?=
UTF-8 quoted printable like: =?utf-8?Q?...?=
No encoding (if only ASCII characters) like: ...

So with POCO (Base64Decoder, Latin2Encoding, UTF8Encoding, QuotedPrintableDecoder) I managed to convert all the cases:

#include <iostream>
#include <string>
#include <sstream>

#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MessageHeader.h>
#include <Poco/Net/MailMessage.h>
#include <Poco/Base64Decoder.h>
#include <Poco/Latin2Encoding.h>
#include <Poco/UTF8Encoding.h>
#include <Poco/Net/QuotedPrintableDecoder.h>

using namespace std;

class Encoder
{
public:
    Encoder(const string& encodedText)
    {
        isStringEncoded = isEncoded(encodedText);
        if (!isStringEncoded)
        {
            extractedEncodedSubjectToConvert = encodedText;
            return;
        }

        splitEncodedText(encodedText);
    }

    string convert()
    {
        if (isStringEncoded)
        {
            if (Poco::Latin2Encoding().isA(charset))
                return decodeFromLatin2();
            if (Poco::UTF8Encoding().isA(charset))
                return decodeFromUtf8();
        }

        return extractedEncodedSubjectToConvert;
    }

private:
    void splitEncodedText(const string& encodedText)
    {
        ///    encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
        const int charsetBeginPosition = strlen(sequenceBeginEncodedText);
        const int charsetEndPosition = encodedText.find("?", charsetBeginPosition);
        charset = encodedText.substr(charsetBeginPosition, charsetEndPosition-charsetBeginPosition);

        const int encodingPosition = charsetEndPosition + strlen("?");
        encoding = encodedText[encodingPosition];

        const int lenghtOfEncodedText = encodedText.length() - encodingPosition-strlen(sequenceBeginEncodedText)-strlen(sequenceEndEncodedText);
        extractedEncodedSubjectToConvert = encodedText.substr(encodingPosition+2, lenghtOfEncodedText);
    }

    bool isEncoded(const string& encodedSubject)
    {
        if (encodedSubject.size() < 4)
            return false;

        if (0 != encodedSubject.find(sequenceBeginEncodedText))
            return false;

        const unsigned positionOfLastTwoCharacters = encodedSubject.size() - strlen(sequenceEndEncodedText);
        return positionOfLastTwoCharacters == encodedSubject.rfind(sequenceEndEncodedText);
    }

    string decodeFromLatin2()
    {
        size_t positionOfAssignment = -1;
        while (true)
        {
            positionOfAssignment = extractedEncodedSubjectToConvert.find('=', positionOfAssignment+1);
            if (string::npos != positionOfAssignment)
            {
                const string& charHexCode = extractedEncodedSubjectToConvert.substr(positionOfAssignment + 1, 2);
                replaceAllSubstringsWithUnicode(extractedEncodedSubjectToConvert, charHexCode);
            }
            else
                break;
        }
        return extractedEncodedSubjectToConvert;
    }

    void replaceAllSubstringsWithUnicode(string& s, const string& charHexCode)
    {
        static Poco::UTF8Encoding encodingConverter;
        const int charCode = stoi(charHexCode, nullptr, 16);

        char buffer[10] = {};
        encodingConverter.convert(charCode, (unsigned char*)buffer, sizeof(buffer));
        replaceAll(s, '=' + charHexCode, buffer);
    }

    void replaceAll(string& s, const string& replaceFrom, const string& replaceTo)
    {
        size_t needlePosition = -1;
        while (true)
        {
            needlePosition = s.find(replaceFrom, needlePosition + 1);
            if (string::npos == needlePosition)
                break;

            s.replace(needlePosition, replaceFrom.length(), replaceTo);
        }
    }

    string decodeFromUtf8()
    {
        if('B' == toupper(encoding))
        {
            return decodeFromBase64();
        }
        else // if Q:
        {
            return decodeFromQuatedPrintable();
        }
    }

    string decodeFromBase64()
    {
        istringstream is(extractedEncodedSubjectToConvert);
        Poco::Base64Decoder e64(is);

        extractedEncodedSubjectToConvert.clear();
        string buffer;
        while(getline(e64, buffer))
            extractedEncodedSubjectToConvert += buffer;
        return extractedEncodedSubjectToConvert;
    }

    string decodeFromQuatedPrintable()
    {
        replaceAll(extractedEncodedSubjectToConvert, "_", " ");


        istringstream is(extractedEncodedSubjectToConvert);
        Poco::Net::QuotedPrintableDecoder qp(is);

        extractedEncodedSubjectToConvert.clear();
        string buffer;
        while(getline(qp, buffer))
            extractedEncodedSubjectToConvert += buffer;
        return extractedEncodedSubjectToConvert;
    }


private:
    string charset;
    char encoding;

    string extractedEncodedSubjectToConvert;
    bool isStringEncoded;

    static constexpr const char* sequenceBeginEncodedText = "=?";
    static constexpr const char* sequenceEndEncodedText   = "?=";
};

int main()
{
    Poco::Net::POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");

    Poco::Net::POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);

    Poco::Net::MessageHeader header;
    Poco::Net::MailMessage message;

    auto currentMessage = messages[0];

    session.retrieveHeader(currentMessage.id, header);
    session.retrieveMessage(currentMessage.id, message);    

    const string subject = message.getSubject();

    Encoder encoder(subject);
    cout << "Original subject: " << subject << endl;
    cout << "Encoded: " << encoder.convert() << endl;
}

Shouldn't this feature be builtin into the POCO library? Every email parser needs it, and it needs it in the same way. So there's no point in having every application write the same code again. — Roland Illig, Jan 05 '17 at 08:03
True, there should be something builtin easier to use. Everything I found is how to encode word of mail message: https://pocoproject.org/docs/Poco.Net.MailMessage.html#22506 , but not how to decode in portable way — baziorek, Jan 05 '17 at 09:10

C++ decode e-mail's subject

3 Answers3