How to Download webpage as .mhtml

Question

I am able to successfully open a URL and save the resultant page as a .html file. However, I am unable to determine how to download and save a .mhtml (Web Page, Single File).

My code is:

import urllib.parse, time
from urllib.parse import urlparse
import urllib.request

url = ('https://www.example.com')

encoded_url = urllib.parse.quote(url, safe='')

print(encoded_url)

base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")

translation_url = base_url+encoded_url

print(translation_url)

req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})

print(req)

response = urllib.request.urlopen(req)

time.sleep(15)

print(response)

webContent = response.read()

print(webContent)

f = open('GoogleTranslated.html', 'wb')

f.write(webContent)

print(f)

f.close

I have tried to use wget using the details captured in this question: How to download a webpage (mhtml format) using wget in python but the details are incomplete (or I am simply unabl eto understand).

Any suggestions would be helpful at this stage.

I was unable to determine how to take the syntax (options) provided in the wget case I referenced with wget as it is used in Python. I was able to successfully download a html file using wget using the syntax: import wget wget.download("http://www.example.com", "test.html") — Ghulam, Feb 22 '20 at 12:42
The linked question's only answer shows how to download a page tree, but doesn't show how to save it as `.mhtml`. I don't think there's a way to do that with `wget` but it should not be hard to do with Python once you understand the format. Basically, create an `email.message.EmailMessage` and `attach` each downloaded page element. — tripleee, Feb 22 '20 at 12:45
@tripleee - I should point out that I have used the browser based "Save As" option and the only options which provides me with a truly 'offline' version of the page is "Web Page, Complete". It would seem that .mhtml option is also not appropriate. Finally, all this is related to me trying to save the output of a google translate request. Will the `email.message.EmailMessage` option you mentioned work in my case? Thanks. — Ghulam, Feb 22 '20 at 13:13
It's the format used as the MHTML container, what you save and how it's useful is up to you. If you want a translation, why do you care about anything else on the page? — tripleee, Feb 22 '20 at 13:39

Yabin CHENG · Answer 1 · 2023-03-15T20:25:54.907

Compared with previous answers, my solution does not involve any controlled mouse or keyboard operations. Also downloaded mhtml files could be stroed in any location you provide. I learnt this method from a Chinese blog. The key idea is using the chrome-dev-tools command.

The code is shown below as an example.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.qq.com/')

# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})

# Write the file locally
with open('./store/qq.mhtml', 'w', newline='') as f:   
    f.write(res['data'])

driver.quit()

Hope this will help! And you may checkout about chrome dev protocols here.

I think the Chinese blog link is https://www.cnblogs.com/superhin/p/12600358.html. That's a very smart method. — ZIQIANG ZHAO, Aug 14 '23 at 09:40

score 4 · Accepted Answer · answered Feb 22 '20 at 13:33

Did you try using Selenium with a Chrome Webdriver to save page?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)


# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
    pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')

Worked perfectly. Thank you!! Had to specify the location of the chromdriver in the python script even though I added it to my path. — Ghulam, Feb 22 '20 at 17:01
How to select the Save as type: Webpage, Single File(*.mhml)? — Victor Wang, Jan 26 '21 at 20:14

score 1 · Answer 3 · answered Jun 17 '21 at 11:41

1

save as mhtml, need to add argument '--save-page-as-mhtml'

options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)

answered Jun 17 '21 at 11:41

helloc

31
7

川田侑彌 · Answer 4 · 2023-01-03T13:26:16.440

I wrote it just the way it was. Sorry if it's wrong.
I created a class, so you can use it. The example is in the three lines below.
Also, you can change the number of seconds to sleep as you like.
Incidentally, non-English keyboards such as Japanese and Hangul keyboards are also supported.

import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid


class DonwloadMhtml(webdriver.Chrome):
    def __init__(self):
        super().__init__()
        self._first_save = True
        time.sleep(2)

    
    def save_page(self, url, filename=None):
        self.get(url)


        time.sleep(3)
        # open 'Save as...' to save html and assets
        pyautogui.hotkey('ctrl', 's')
        time.sleep(1)

        if filename is None:
            pyperclip.copy(str(uuid.uuid4()))
        else:
            pyperclip.copy(filename)
            
        time.sleep(1)
        pyautogui.hotkey('ctrl', 'v')
        time.sleep(2)
        
        
        if self._first_save:
            pyautogui.hotkey('tab')
            time.sleep(1)
            pyautogui.press('down')
            time.sleep(1)
            pyautogui.press('up')
            time.sleep(1)
            pyautogui.hotkey('enter')
            time.sleep(1)
            self._first_save = False
            
        pyautogui.hotkey('enter')
        time.sleep(1)


# example
dm = DonwloadMhtml()


dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python')         # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/')                                                                 # file named randomly based on uuid4

python3.8.10
selenium==4.4.3

How to Download webpage as .mhtml

4 Answers4

Linked