0

How do I use the Mail gem for Ruby to extract the original message HTML content/text content from a forwarded email?

So far all the examples I see are related to extracting content from replies (not forwards), which is made a lot easier because you can just key in on --reply above this line-- in the message.

But in my case, I’m having people forward me confirmation emails, such as how TripIt parses flight itineraries from many different airline emails.

The problem is there is a complex hierarchy of “parts”, as well as parts containing other parts, and I am trying to come up with a foolproof way to find the original HTML source so I can parse it, and extract information from a forwarded email raw source.

m = Mail.read('raw.txt')

m.parts
m.parts.first.parts
m.parts.last.parts.first.parts # never ending....
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Verty00
  • 726
  • 5
  • 16
  • Please read "[mcve](https://stackoverflow.com/help/minimal-reproducible-example)". We need minimal code from you to test for the problem. That also means the minimal input data and the expected output. Without that you're asking us to devise a test suite for you, which may, or may not solve the problem you're having. In other words we'll be guessing and wasting our time, so we need you to help us help you by narrowing down the situation. – the Tin Man Oct 22 '19 at 23:49

1 Answers1

3

Here's what I have done in the past, which just recursively looks for the largest HTML body. This will probably break with multi-level forwards but in our case it only needs to be 1 forward level deep and so far works great.

It's unfortunate the state of Stack Overflow these days thanks to stupid votes to close on every single question, that IMO is legitimate. Do people really expect you to dump 5000 lines of HTML into your question, its quite obvious what you're asking

module EmailProcessor
  class Parser
    def initialize(email)
      @email = email
      raise 'must be initialized with type InboundEmail' unless @email.instance_of?(InboundEmail)
    end

    def execute
      mail = Mail.read_from_string(@email.postmark_raw['RawEmail'])
      html = find_original_html(mail)
    end

    private

    def find_original_html(mail)
      bodies = recurse_parts(mail.parts)
      sorted = bodies.sort_by{|b| -b.size}
      puts "PARSED #{sorted.size} BODIES: #{sorted.map{|b| b.size}}"
      sorted.first
    end

    def recurse_parts(parts)
      bodies = []
      parts.each do |part|
        if part.multipart?
          bodies += recurse_parts(part.parts)
        elsif part.content_type =~ /text\/html/
          bodies << part.body.decoded
        end
      end
      bodies
    end
  end
end
Tallboy
  • 12,847
  • 13
  • 82
  • 173