2

I'm trying to get a simple data from html files by using beautiful soup 4. But i have a problem when I encounter the mthml file. The html parser is not working in mhtml file. So, i need to convert mhtml files to simple html file. Or load mhtml file by other things, bcuz the main purpose is just extract one data.

Is anyone can help this by using python? (i already know I can easily convert this by using MS Word...but i want to do this by python programming by automatically)

Danny Lim
  • 21
  • 1
  • 2
  • What did you tried so far? Can you post some code? – Cartucho Jan 09 '19 at 19:38
  • Could you just give me some hint for this? i'm entry of the python program so far – Danny Lim Jan 09 '19 at 22:24
  • 1
    Well, if you just want a hint, then based on a quick search an MHTML file is formatted as a MIME html email, so I would imagine that you would first parse it as such, extract the HTML portion of it (Wikipedia says it's normally the second part after the header), and then parse the HTML portion with bs4. – Reid Ballard Jan 11 '19 at 14:00

1 Answers1

1

There's a repo on github, named MHTifier, worth a look. Code is written Python2, it's readable and well commented. Although it's a work under progress but still can be a good starting point.

ResilientBit
  • 121
  • 4