2

I'm sorry if the answer to my question is obvious, since I'm a complete newbie to multiprocessing.

I'm trying to write a multiprocessed web-scraping script with pathos. I chose pathos since, as far as I understand, unlike Python's traditional multiprocessing module, it is invulnerable to pickling issues and doesn't need every function or class to be in a top-level module.

In general, a pseudo-code of what I'm trying to do looks like this:

from functools import partial
from pathos.multiprocessing import ProcessPool as Pool
from selenium import webdriver


def get_urls(main_page):
    """Extracts URLs from a website's main page; returns list."""
    return urls


def extract_text(url, web_driver):
    """Gets selenium.webdriver instance and an URL as arguments;
       extracts text from this url; returns string"""
    return text


if __name__ == '__main__':

    MAINPAGE = "http://some/link/for/scraping"
    driver = webdriver.Chrome("path/to/chrome/binary")
    myLinks = get_urls(MAINPAGE)
    pool = Pool(nodes=4)
    part_text = partial(extract_text, web_driver=driver)
    results = pool.map(part_text, myLinks)

    print(results)

Nonetheless, despite my having both dill properly installed and _multiprocessing imported without any problems, I always get the following error when I run my code:

Traceback (most recent call last):
   File "D:\Anaconda\lib\site-packages\dill\_dill.py", line 688, in _create_filehandle
   f = open(name, mode)
   OSError: [WinError 6] The handle is invalid

And also:

_pickle.UnpicklingError: [WinError 6] The handle is invalid

Can it be a Windows-specific issue? Unfortunately, although I personally prefer Linux, this script needs to be run on a Windows 10 64bit machine. I tried both Python 3.6 (Anaconda 64bit) and Python 3.7 32bit on two Windows 10 machines and got the same error.

Thanks in advance for any ideas, help, and suggestions.

ntonk
  • 95
  • 8
  • 1
    Well, as far as I understood, it failed because I was trying to use single selenium.webdriver instance for parallel tasks. Now it works since I initialize a separate webdriver instance for each process. – ntonk Nov 28 '18 at 12:10

0 Answers0