0

I'm trying to import in python an HTML table within an email.

I tried the following script:

HOST = 'imap.gmail.com'
USERNAME = username
PASSWORD = password
ssl = True

server = IMAPClient(HOST, use_uid=True, ssl=ssl)
server.login(USERNAME, PASSWORD)

select_info = server.select_folder('INBOX')
messages = server.search(['FROM', sender_address])

if len(messages) > 0:
   for mail_id, data in server.fetch(messages,['ENVELOPE','BODY[TEXT]']).items():
       envelope = data[b'ENVELOPE']
       body = data[b'BODY[TEXT]']

soup = BeautifulSoup(body)
table = soup.find_all('table')
df = pd.read_html(str(table))[0]

The script works fine, but I get some random "=" and a "<= /td>" inserted within the table. This is a sample of the dataframe output with errors in yellow enter image description here

This is a sample of the original email table: enter image description here

I think that the error lies in the IMAPClient commands (and not in BS parsing or pandas) because if I inspect the HTML within the "body" variable I find the errors already there.

What am I doing wrong? Thanks

younggotti
  • 762
  • 2
  • 15
  • You are not MIME decoding the message. Emails are encoded for transport. These extra = are part of the Quoted Printable encoding thats applied to make it fit in the standard email limits of 7 bit ascii and 78 character lines. See the email/MIME modules in python, or maybe IMAPClient has helper methods to do the parsing and decoding for you. – Max Dec 01 '21 at 17:35
  • Try my https://github.com/ikvk/imap_tools, let me know results – Vladimir Dec 02 '21 at 05:08

1 Answers1

1

In case anyone needs it, the following code worked fine:

imap = imaplib.IMAP4_SSL('imap.gmail.com')
imap.login(username, password)
imap.select("inbox")

resp, items = imap.search(None, '(FROM "xxxx@xxxxxxxxx.com")')

for n, num in enumerate(items[0].split(), 1):
    resp, data = imap.fetch(num, '(RFC822)')

    body = data[0][1]
    msg = email.message_from_bytes(body)
    content = msg.get_payload(decode=True)

    soup = BeautifulSoup(content)
    table = soup.find_all('table')
    df = pd.read_html(str(table))[0]
younggotti
  • 762
  • 2
  • 15