2

I'm trying to convert the HtmlBody of the e-mails I get from a mailserver using Mailkit and looks like iTextSharp doesn't really like the html I'm passing it.

My method works well with a "sample" html code but I get a The document has no pages error message which looks like it's thrown when the html is no html anymore.

public void GenerateHtmlFromBody(UniqueId uid)
{
    var email = imap.Inbox.GetMessage(uid);
    Byte[] bytes;

    using (var ms = new MemoryStream())
    {
        using (var doc = new Document())
        {
            using (var writer = PdfWriter.GetInstance(doc, ms))
            {
                doc.Open();

                //Sample HTML and CSS
                var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
                var example_css = @".headline{font-size:200%}";

                using (var srHtml = new StringReader(email.HtmlBody))
                {
                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
                }
                doc.Close();
            }
        }
        bytes = ms.ToArray();
    }
    var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "processedMailPdf.pdf");
    System.IO.File.WriteAllBytes(testFile, bytes);
}

I'm accesing to MimeMessage.HtmlBody and debugging, looks like it's, in fact, html.

Here is a link to pastebin for checking the HtmlBody of the MimeMessage because I hit the character limit here.

What am I missing? Thanks.

EDIT: I've tried using the HTMLWorker (which is deprecated) and it's not stable. It worked with one e-mail but not with others. Of course it wasn't a solution, but it finally generated a pdf from Mailkit, which was "something".

Gonzo345
  • 1,133
  • 3
  • 20
  • 42
  • Have you tried another email? Also the html looks horrible, my editor isn't even able to recognize collapsible tags (VS Code) – Sebastian L Mar 03 '17 at 11:30
  • Thanks for pointing that. That was a forwarded mail from a newsletter. I've just tried with the typical Outlook "autotesting" mail which is sent by Outlook to test if the connectivity is good or not, and it doesn't work. The thing is that Mailkit recognizes it as html, but looking into the HtmlBody it's just plain text :o – Gonzo345 Mar 03 '17 at 11:36
  • you should try something like this: https://raw.githubusercontent.com/leemunroe/responsive-html-email-template/master/email.html – Sebastian L Mar 03 '17 at 11:41
  • I've just tried to convert an e-mail from my main account directly, which looks like respects all the tags with no luck either. Could it be something related with Mailkit? The "example_html" string I have there works fine, huh – Gonzo345 Mar 03 '17 at 12:06
  • 1
    The problem is probably that you are using an XHTML parser which will only work with HTML that strictly conforms with XML standards (your sample on pastebin does not). You could try using HtmlAgilityPack to parse it instead, but I'm not sure if that would allow you to convert it to a PDF. – jstedfast Mar 03 '17 at 15:27
  • Try to make sure that every piece of text in your HTML is inside a HTML element. If in doubt use `"" + email.HtmlBody + ""` – mkl Mar 03 '17 at 17:15
  • 1
    @mkl - If `HtmlBody` is indeed sometimes plain text as noted in the second comment, `"" + email.HtmlBody + ""` throws `Unhandled Exception: System.IO.IOException: The document has no pages.`. Tested with iTextSharp and XML Worker versions 5.5.10. Maybe you found a bug.... – kuujinbo Mar 03 '17 at 22:18
  • Probably one should use a content oriented tag like `
    ` for that?
    – mkl Mar 04 '17 at 06:31

2 Answers2

2

Looks like you're facing two issues with HtmlBody:

  1. It may be plain text.
  2. When [X]HTML, it is not well-formed.

Anytime there's a possibility you're dealing with a string that is not well-formed XML, your best bet is to use a parser like HtmlAgilityPack to clean up the mess. Here's a simple helper method using XPath to cover both issues above, and UPDATED based on comments to remove HtmlCommentNodes that break iText XML Worker:

string FixBrokenMarkup(string broken)
{
    HtmlDocument h = new HtmlDocument()
    {
        OptionAutoCloseOnEnd = true,
        OptionFixNestedTags = true,
        OptionWriteEmptyNodes = true
    };
    h.LoadHtml(broken);

    // UPDATED to remove HtmlCommentNode
    var comments = h.DocumentNode.SelectNodes("//comment()");
    if (comments != null) 
    {
        foreach (var node in comments) { node.Remove(); }
    }

    return h.DocumentNode.SelectNodes("child::*") != null
        //                            ^^^^^^^^^^
        // XPath above: string plain-text or contains markup/tags
        ? h.DocumentNode.WriteTo()
        : string.Format("<span>{0}</span>", broken);
}

And for completeness, code to generate the PDF. Tested and working with the pastebin link you provided above:

var fixedMarkup = FixBrokenMarkup(PASTEBIN);
// swap initialization to verify plain-text works too
// var fixedMarkup = FixBrokenMarkup("some text");

using (var stream = new MemoryStream())
{
    using (var document = new Document())
    {
        PdfWriter writer = PdfWriter.GetInstance(document, stream);
        document.Open();
        using (var stringReader = new StringReader(fixedMarkup))
        {
            XMLWorkerHelper.GetInstance().ParseXHtml(
                writer, document, stringReader
            );
        }
    }
    File.WriteAllBytes(OUTPUT, stream.ToArray());
}
kuujinbo
  • 9,272
  • 3
  • 44
  • 57
  • Oh my! Thank you so much, this works like a charm! I basically get that the HthmlAgilityPack is parsing "the HTML" string and analyzing the structure, adding the missing basic html tags in case they're needed, right? Amazing :o – Gonzo345 Mar 06 '17 at 06:40
  • 1
    @Gonzo345 - Yes, that's correct. HthmlAgilityPack adds missing end tags, and, also tries to clean up incorrectly nested tags too. e.g. `td`, `li`, etc. – kuujinbo Mar 06 '17 at 16:28
  • Hi again @kuujinbo! I'm facing problems with a single e-mail which apparently is parsed correctly but when XMLWorkerHelper is called it throws the "hellable" "The document has no pages". This is the [before Agility Pack] (http://pastebin.com/uu9AezyG) and the [after Agility Pack] (http://pastebin.com/Mt5xQHLB) . I don't initially see anything wrong, in fact it has just added the quotes. Any ideas? Everything was just so perfect :( – Gonzo345 Mar 09 '17 at 12:07
  • 1
    @Gonzo345 - I'll take a look tonight or tomorrow. Strange description for the `Exception` - wonder how that got in the iText source code.... – kuujinbo Mar 09 '17 at 16:48
  • Hi! The "hellable" was just a joke lol! I meant I thought it was totally away if I use the Html Agility pack but it wasn't! The frustration comes when I can't really debug why I is failing, and since it's failing with that email, it could fail with another one. The Agility pack itself looks like it's working well, the thing is on the iTextSharp :( Thanks in advance!!!! – Gonzo345 Mar 09 '17 at 20:31
  • 1
    @Gonzo345 - oh the joys of having to deal with proprietary Microsoft Office madness. :( Anyway, I get a different `Exception` - "**_iTextSharp.tool.xml.exceptions.RuntimeWorkerException: Invalid nested tag p found, expected closing tag ![endif]._**". Check the updated `FixBrokenMarkup()` method in the answer. Works for me with the pastebin snippet you posted. – kuujinbo Mar 10 '17 at 01:57
  • I have just tried your updated code and... what the hell, it's not failing! I've tried rolling it back to the previous version and it fails again, so... should it fail as it fails to you? xD Thanks! – Gonzo345 Mar 10 '17 at 07:48
  • 1
    I meant the updated code **works**, while the old code fails with a _different_ `Exception` than you were getting. The updated code works, right? It does for me on **both** the pastebin in your question, and also the pastebin in your comment above.... – kuujinbo Mar 10 '17 at 16:38
  • Yes, i was meaning it already works! I understood the opposite :'D Thank you so much for your help! P.D: did really "the comments" made it fail? Interesting – Gonzo345 Mar 10 '17 at 18:06
  • 1
    @Gonzo345 - Word inserts stuff like `<![endif]>` when converting a document to HTML, which causes the `Exception`. – kuujinbo Mar 10 '17 at 18:52
0

I found, that iTestSharp has problem with tag <br>. Use <br/> instead.

  • 2
    That's an example of what other answers and comments here meant when mentioning xml - itext xmlworker expects xhtml, i.e. every opening element must be closed. A single `
    `, therefore, is invalid and `
    ` is valid.
    – mkl Jan 15 '21 at 09:07