How to open and save a large number of webpages using Chrome and pywinauto?

Question

I have tens of thousands of URLs which I want to save their webpages to my computer.

I'm trying to open and save these webpages using Chrome automated by pywinauto. I'm able to open the webpages using the following code:

from pywinauto.application import Application
import pyautogui

chrome_dir = 'C:\Program Files\Google\Chrome\Application\chrome.exe'

start_args = ' --force-renderer-accessibility --start-maximized https://pythonexamples.org/'
             app = Application(backend="uia").start(chrome_dir+start_args)

I want to further send a shortcut to the webpage to save it as a mhtml. Ctrl+Shift+Y is the shortcut of a Chrome extension (called SingleFile) that saves a webpage as mhmtl. Then I want to close the tab by typing "Ctrl + F4", before I open another one and repeat the same process.

The keys are not successfully sent to Chrome.

# Sent shortcut (Ctrl+Shift+Y)
pyautogui.press(['ctrl', 'shift', 'y'])

# Close the current tab:
pyautogui.press(['ctrl', 'f4'])

I'm stuck at this step. What's the right way to do this? Thank you! Tried other alternatives like Selenium, but it was blocked by the remote server.

You can do it with Pywinauto but it's not the best way. You should have a look to: https://pyppeteer.github.io/pyppeteer/ and https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use — David Pratmarty, Feb 02 '21 at 08:28
I tried Selenium. The challenge is that the remote server quickly detected that I was using a scraping tool and blocked me. Not sure if puppeteer has the same issue. — Victor Wang, Feb 02 '21 at 15:43
SingleFile can be run from the CLI (https://github.com/gildas-lormeau/SingleFile/tree/master/cli) and crawl websites. — check_ca, Feb 04 '21 at 16:07

score 2 · Answer 1 · edited Feb 02 '21 at 19:45

2

Why are you using Chrome to get the website data? Generally, using an external application directly (ie. emulating a user) is a horrible and inefficient way to do anything. If your objective is to quickly get and store the data from a website, you should be talking directly to the website, using something like the requests module, which lets you quickly and easily send an HTTP request and get all of the website data. To get MHTML data, you can try something like this.

edited Feb 02 '21 at 19:45

Dharman

30,962
25
85
135

answered Feb 02 '21 at 19:39

Awesomepotato29

120
11

2

The website has an anti-scraping mechanism. Common scraping tools did not work. I have to use an explorer to make the remote server believe it's a human behind the screen. – Victor Wang Feb 02 '21 at 19:45
Ah, I see. Take a look at [this guide to circumvent anti-scraping measures](https://blog.datahut.co/web-scraping-how-to-bypass-anti-scraping-tools-on-websites/). This seems like it would help. – Awesomepotato29 Feb 02 '21 at 20:25
Thank you! I'll take a look. – Victor Wang Feb 03 '21 at 17:01

How to open and save a large number of webpages using Chrome and pywinauto?

1 Answers1