I'm trying to import in python an HTML table within an email.
I tried the following script:
HOST = 'imap.gmail.com'
USERNAME = username
PASSWORD = password
ssl = True
server = IMAPClient(HOST, use_uid=True, ssl=ssl)
server.login(USERNAME, PASSWORD)
select_info = server.select_folder('INBOX')
messages = server.search(['FROM', sender_address])
if len(messages) > 0:
for mail_id, data in server.fetch(messages,['ENVELOPE','BODY[TEXT]']).items():
envelope = data[b'ENVELOPE']
body = data[b'BODY[TEXT]']
soup = BeautifulSoup(body)
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
The script works fine, but I get some random "=" and a "<= /td>" inserted within the table.
This is a sample of the dataframe output with errors in yellow
This is a sample of the original email table:
I think that the error lies in the IMAPClient commands (and not in BS parsing or pandas) because if I inspect the HTML within the "body" variable I find the errors already there.
What am I doing wrong? Thanks