I am working toward being able to input any email message and output an equivalent XML encoding.
I am starting small, with one of the email headers -- the "From Header"
Here is an example of a From Header:
From: John Doe <john@doe.org>
I want it transformed into this XML:
<From>
<Mailbox>
<DisplayName>John Doe</DisplayName>
<Address>john@doe.org</Address>
</Mailbox>
</From>
I want to use the lexical analyzer "Alex" (http://www.haskell.org/alex/doc/html/) to break apart (tokenize) the From Header.
I want to use the parser "Happy" (http://www.haskell.org/happy/) to process the tokens and generate a parse tree.
Then I want to use a serializer to walk the parse tree and output XML.
The format of the From Header is specified by the Internet Message Format (IMF), RFC 5322 (https://www.rfc-editor.org/rfc/rfc5322).
Here are a few more examples of From Headers and the desired XML output:
From Header with no display name:
From: <john@doe.org>
Desired XML output:
<From>
<Mailbox>
<Address>john@doe.org</Address>
</Mailbox>
</From>
From Header with no display name and no angle brackets around the address:
From: john@doe.org
Desired XML output:
<From>
<Mailbox>
<Address>john@doe.org</Address>
</Mailbox>
</From>
From Header with multiple mailboxes, each separated by a comma:
From: <john@doe.org>, "Simon St. John" <simon@stjohn.org>, sally@smith.org
Desired XML output:
<From>
<Mailbox>
<Address>john@doe.org</Address>
</Mailbox>
<Mailbox>
<DisplayName>Simon St. John</DisplayName>
<Address>simon@stjohn.org</Address>
</Mailbox>
<Mailbox>
<Address>sally@smith.org</Address>
</Mailbox>
</From>
RFC 5322 says that the syntax for comment is: ( … ). Here is a From Header containing a comment:
From: (this is a comment) "John Doe" <john@doe.org>
I want all comments removed during lexing.
The desired XML output is this:
<From>
<Mailbox>
<DisplayName>John Doe</DisplayName>
<Address>john@doe.org</Address>
</Mailbox>
</From>
The RFC says that there can be "folding whitespace" scattered throughout the From Header. Here is a From Header with the From: token on the first line, the display name on the second line, and the address on the third line:
From:
"John Doe"
<john@doe.org>
The XML output should not be affected by the folding whitespace:
<From>
<Mailbox>
<DisplayName>John Doe</DisplayName>
<Address>john@doe.org</Address>
</Mailbox>
</From>
The RFC says that after the @ character in the address can be a string enclosed in brackets, such as this:
From: "John Doe" <john@[website]>
I must admit that I have never seen emails with that. Nonetheless, the RFC says it is allowed, so I certainly want my lexer and parser to handle such inputs. Here is the desired output:
<From>
<Mailbox>
<DisplayName>John Doe</DisplayName>
<Address>john@[website]</Address>
</Mailbox>
</From>
Error Handling
I want an error generated if the From Header is incorrect. Here are a couple examples of erroneous From Headers and the desired output:
The display name is erroneously placed after the address:
From: <john@doe.org> "John Doe"
The output should specify the location that the error was discovered:
serialize: parse error at line 1 and column 22. Error occurred at "John Doe"
This From Header has an erroneous "23" before the display name:
From: 23 "John Doe" <john@doe.org>
Again, the output should specify the location that the error was discovered:
serialize: parse error at line 1 and column 10. Error occurred at "John Doe"
Would you please show how to implement the lexer, parser, and serializer?