7

I am able to successfully open a URL and save the resultant page as a .html file. However, I am unable to determine how to download and save a .mhtml (Web Page, Single File).

My code is:

import urllib.parse, time
from urllib.parse import urlparse
import urllib.request

url = ('https://www.example.com')

encoded_url = urllib.parse.quote(url, safe='')

print(encoded_url)

base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")

translation_url = base_url+encoded_url

print(translation_url)

req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})

print(req)

response = urllib.request.urlopen(req)

time.sleep(15)

print(response)

webContent = response.read()

print(webContent)

f = open('GoogleTranslated.html', 'wb')

f.write(webContent)

print(f)

f.close

I have tried to use wget using the details captured in this question: How to download a webpage (mhtml format) using wget in python but the details are incomplete (or I am simply unabl eto understand).

Any suggestions would be helpful at this stage.

Ghulam
  • 135
  • 1
  • 8
  • What error did you get when using `wget`? – Jongware Feb 22 '20 at 12:18
  • I was unable to determine how to take the syntax (options) provided in the wget case I referenced with wget as it is used in Python. I was able to successfully download a html file using wget using the syntax: import wget wget.download("http://www.example.com", "test.html") – Ghulam Feb 22 '20 at 12:42
  • The linked question's only answer shows how to download a page tree, but doesn't show how to save it as `.mhtml`. I don't think there's a way to do that with `wget` but it should not be hard to do with Python once you understand the format. Basically, create an `email.message.EmailMessage` and `attach` each downloaded page element. – tripleee Feb 22 '20 at 12:45
  • @tripleee - I should point out that I have used the browser based "Save As" option and the only options which provides me with a truly 'offline' version of the page is "Web Page, Complete". It would seem that .mhtml option is also not appropriate. Finally, all this is related to me trying to save the output of a google translate request. Will the `email.message.EmailMessage` option you mentioned work in my case? Thanks. – Ghulam Feb 22 '20 at 13:13
  • It's the format used as the MHTML container, what you save and how it's useful is up to you. If you want a translation, why do you care about anything else on the page? – tripleee Feb 22 '20 at 13:39

4 Answers4

8

Compared with previous answers, my solution does not involve any controlled mouse or keyboard operations. Also downloaded mhtml files could be stroed in any location you provide. I learnt this method from a Chinese blog. The key idea is using the chrome-dev-tools command.

The code is shown below as an example.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.qq.com/')

# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})

# Write the file locally
with open('./store/qq.mhtml', 'w', newline='') as f:   
    f.write(res['data'])

driver.quit()

Hope this will help! And you may checkout about chrome dev protocols here.

Yabin CHENG
  • 91
  • 1
  • 4
4

Did you try using Selenium with a Chrome Webdriver to save page?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)


# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
    pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
Thaer A
  • 2,243
  • 1
  • 10
  • 14
1

save as mhtml, need to add argument '--save-page-as-mhtml'

options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)

helloc
  • 31
  • 7
0

I wrote it just the way it was. Sorry if it's wrong.
I created a class, so you can use it. The example is in the three lines below.
Also, you can change the number of seconds to sleep as you like.
Incidentally, non-English keyboards such as Japanese and Hangul keyboards are also supported.

import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid


class DonwloadMhtml(webdriver.Chrome):
    def __init__(self):
        super().__init__()
        self._first_save = True
        time.sleep(2)

    
    def save_page(self, url, filename=None):
        self.get(url)


        time.sleep(3)
        # open 'Save as...' to save html and assets
        pyautogui.hotkey('ctrl', 's')
        time.sleep(1)

        if filename is None:
            pyperclip.copy(str(uuid.uuid4()))
        else:
            pyperclip.copy(filename)
            
        time.sleep(1)
        pyautogui.hotkey('ctrl', 'v')
        time.sleep(2)
        
        
        if self._first_save:
            pyautogui.hotkey('tab')
            time.sleep(1)
            pyautogui.press('down')
            time.sleep(1)
            pyautogui.press('up')
            time.sleep(1)
            pyautogui.hotkey('enter')
            time.sleep(1)
            self._first_save = False
            
        pyautogui.hotkey('enter')
        time.sleep(1)


# example
dm = DonwloadMhtml()


dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python')         # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/')                                                                 # file named randomly based on uuid4

python3.8.10
selenium==4.4.3