I'm sorry if the answer to my question is obvious, since I'm a complete newbie to multiprocessing.
I'm trying to write a multiprocessed web-scraping script with pathos. I chose pathos since, as far as I understand, unlike Python's traditional multiprocessing module, it is invulnerable to pickling issues and doesn't need every function or class to be in a top-level module.
In general, a pseudo-code of what I'm trying to do looks like this:
from functools import partial
from pathos.multiprocessing import ProcessPool as Pool
from selenium import webdriver
def get_urls(main_page):
"""Extracts URLs from a website's main page; returns list."""
return urls
def extract_text(url, web_driver):
"""Gets selenium.webdriver instance and an URL as arguments;
extracts text from this url; returns string"""
return text
if __name__ == '__main__':
MAINPAGE = "http://some/link/for/scraping"
driver = webdriver.Chrome("path/to/chrome/binary")
myLinks = get_urls(MAINPAGE)
pool = Pool(nodes=4)
part_text = partial(extract_text, web_driver=driver)
results = pool.map(part_text, myLinks)
print(results)
Nonetheless, despite my having both dill properly installed and _multiprocessing imported without any problems, I always get the following error when I run my code:
Traceback (most recent call last):
File "D:\Anaconda\lib\site-packages\dill\_dill.py", line 688, in _create_filehandle
f = open(name, mode)
OSError: [WinError 6] The handle is invalid
And also:
_pickle.UnpicklingError: [WinError 6] The handle is invalid
Can it be a Windows-specific issue? Unfortunately, although I personally prefer Linux, this script needs to be run on a Windows 10 64bit machine. I tried both Python 3.6 (Anaconda 64bit) and Python 3.7 32bit on two Windows 10 machines and got the same error.
Thanks in advance for any ideas, help, and suggestions.