2

I am trying to get the contents of email originally sent to a user which is forwarded to my current account. I am using gmail api to do so.

But I am unable to filter the extra text that comes with it. This is my code to access one email by one message id at a time.

import lxml.html
from bs4 import BeautifulSoup
myid = "18147a05aba83e45"
# service is the gmail service which is necessary to access gmail api i have not included the code to get service
txt = service.users().messages().get(userId='me', id=myid).execute()

try:
    if 'INBOX' in txt['labelIds']:
        payload = txt['payload']
        parts = payload.get('parts')[0]
        data = parts['body']['data']
        data = data.replace("-","+").replace("_","/")
        decoded_data = base64.b64decode(data)
        soup = BeautifulSoup(decoded_data , "lxml")
        body = soup.body()
        body = str(body)
        body = body[1:-1]
        my_msg = lxml.html.fromstring(body).text_content()
        print(my_msg)
except Exception as e:
    print('exception', e)

This is the email sent to user.

Hshhsjshshhsbsbhs d dddd
1.5.5.5
2. D d. Sbsh

This is the output from my code

-----Forwarded message-----
From: NiKHiL
Date: Mon, Jul 11, 2822, 12:42 PM
Subject: 0eiginal sub
To: nikhilchauhanxd@gmail.com

Hshhsjshshhsbsbhs d dddd
1.5.5.5
2. D d. Sbsh
,
Date: Mon, Jul 11, 2022, 12:42 PM
Subject: Oeiginal sub
To: nikhilchauhanxd@gnail.com

Hshhsjshshhsbsbhs d dddd
1.5.5.5
2. D d. Sbsh
,

Hshhsjshshhsbsbhs d dddd
1.5.5.5
2. D d. Sbsh

As you can see in the output original message is repeated many times and comes with extra text.

Any help will be appreciated. Thanks in advance.

NiKHiL
  • 41
  • 3

1 Answers1

1

Unfortunately, it doesn't look like there's a guaranteed way to do this. The Gmail API Message object doesn't include a way to discern between the original message and forwarded copies. As far as it's concerned it's just a single block of text.

You could try to figure out a way to separate the original manually. For example, in your output it seems that each forwarded message is separated by a , in a newline. Gmail doesn't do this so I'm guessing it was added by BeautifulSoup. You can run something like

split = your_output.split("\n,\n\n")
original_msg = split[-1] 

Based on your sample this would return only the original message. Of course, there could be issues if it happens that BeautifulSoup doesn't always separate the messages with commas or if the message itself contains a comma in its own line.

Another possibility is if you get the second part in the payload instead of the first, like this: parts = payload.get('parts')[1]. This returns the email in HTML format and the output of a forwarded message looks kind of like this:

<div dir="ltr">
    This is some added text<br />
    <br />
    <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">
            ---------- Forwarded message ---------<br />
            From: <strong class="gmail_sendername" dir="auto">Some user</strong> <span dir="auto">&lt;<a href="mailto:someemail@example.com">someemail@example.com</a>&gt;</span><br />
            Date: Mon, Jul 25, 2022 at 11:42 AM<br />
            Subject: Fwd: test<br />
            To: Some user &lt;<a href="mailto:someemail@example.com">someemail@example.com</a>&gt;<br />
        </div>
        <br />
        <br />
        <div dir="ltr">
            <br />
            <br />
            <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">
                    ---------- Forwarded message ---------<br />
                    From: <strong class="gmail_sendername" dir="auto">Some user</strong> <span dir="auto">&lt;<a href="mailto:someemail@example.com" target="_blank">someemail@example.com</a>&gt;</span>
                    <br />
                    Date: Mon, Jul 25, 2022 at 11:42 AM<br />
                    Subject: test<br />
                    To: Some user &lt;<a href="mailto:someemail@example.com" target="_blank">someemail@example.com</a>&gt;<br />
                </div>
                <br />
                <br />
                <div dir="ltr">this is a message that will be forwarded</div>
            </div>
        </div>
    </div>
</div>

As you can see, the original message is enclosed in the deepest <div dir="ltr"> tag so you can just try to get the text from this tag to get the original message.

However, there's yet another issue. This works if the message was just forwarded without any changes, but if each user forwarded the message adding some text, this will be at the beginning of each section. I don't know if this would matter to you but you may have to take it into account as well. There may also be other variables that would change the structure of the message so that your normal parsing would still not work. All in all, depending on how complex your messages get this may not be easy or possible, but maybe this can give you a general idea.

Daniel
  • 3,157
  • 2
  • 7
  • 15
  • It's a single comma surrounded by `\n` newlines, which makes it rarer, but yes, that's more or less my point. The Gmail API doesn't track these forwards so it comes down to pretty much looking for patterns in the email body to try to figure out the original one. – Daniel Jul 27 '22 at 18:31