0

I have tens of thousands of URLs which I want to save their webpages to my computer.

I'm trying to open and save these webpages using Chrome automated by pywinauto. I'm able to open the webpages using the following code:

from pywinauto.application import Application
import pyautogui

chrome_dir = 'C:\Program Files\Google\Chrome\Application\chrome.exe'

start_args = ' --force-renderer-accessibility --start-maximized https://pythonexamples.org/'
             app = Application(backend="uia").start(chrome_dir+start_args)

I want to further send a shortcut to the webpage to save it as a mhtml. Ctrl+Shift+Y is the shortcut of a Chrome extension (called SingleFile) that saves a webpage as mhmtl. Then I want to close the tab by typing "Ctrl + F4", before I open another one and repeat the same process.

The keys are not successfully sent to Chrome.

# Sent shortcut (Ctrl+Shift+Y)
pyautogui.press(['ctrl', 'shift', 'y'])

# Close the current tab:
pyautogui.press(['ctrl', 'f4'])

I'm stuck at this step. What's the right way to do this? Thank you! Tried other alternatives like Selenium, but it was blocked by the remote server.

Victor Wang
  • 765
  • 12
  • 26
  • You can do it with Pywinauto but it's not the best way. You should have a look to: https://pyppeteer.github.io/pyppeteer/ and https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use – David Pratmarty Feb 02 '21 at 08:28
  • I tried Selenium. The challenge is that the remote server quickly detected that I was using a scraping tool and blocked me. Not sure if puppeteer has the same issue. – Victor Wang Feb 02 '21 at 15:43
  • SingleFile can be run from the CLI (https://github.com/gildas-lormeau/SingleFile/tree/master/cli) and crawl websites. – check_ca Feb 04 '21 at 16:07
  • This is great. Thank you very much. – Victor Wang Feb 04 '21 at 16:13

1 Answers1

2

Why are you using Chrome to get the website data? Generally, using an external application directly (ie. emulating a user) is a horrible and inefficient way to do anything. If your objective is to quickly get and store the data from a website, you should be talking directly to the website, using something like the requests module, which lets you quickly and easily send an HTTP request and get all of the website data. To get MHTML data, you can try something like this.

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • 2
    The website has an anti-scraping mechanism. Common scraping tools did not work. I have to use an explorer to make the remote server believe it's a human behind the screen. – Victor Wang Feb 02 '21 at 19:45
  • Ah, I see. Take a look at [this guide to circumvent anti-scraping measures](https://blog.datahut.co/web-scraping-how-to-bypass-anti-scraping-tools-on-websites/). This seems like it would help. – Awesomepotato29 Feb 02 '21 at 20:25
  • Thank you! I'll take a look. – Victor Wang Feb 03 '21 at 17:01