0

This code is for crawling, finding and and scraping on a particular site.
The steps are as follows;

  1. Create an Excel file and fill in row 1
  2. Setup for selenium
  3. Click certain Xpath and back to the original page using "for"
  4. Finding the Xpath class's name email and add into Excel
  5. Pass the error with try/except
  6. Re-size the Excel cell
  7. Use "os.path.join" for Excel file path at same location
  8. Save and quit

The codes are working well, however when I transformed into exe with pyinstaller or auto-py-to-exe I get errors returned.

The Python version is 3.11.4 and other packages are all recent releases.

  1. While converting to an exe file an Error is returned;

2570 WARNING: lib not found: pywintypes311.dll dependency of C:\Users\jih19\AppData\Roaming\Python\Python311\site-packages\win32\win32pdh.pyd

so I found the win32pdh.pyd and add the file into hidden import

  1. Also when running the .exe file, the following error occurs;
Traceback (most recent call last):
  File "PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 109, in <module>
  File "PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 19, in _pyi_rthook
ModuleNotFoundError: No module named 'multiprocessing.spawn'; 'multiprocessing' is not a package
[34052] Failed to execute script 'pyi_rth_multiprocessing' due to unhandled exception!

I also find the multiprocessing folder and it is located in C:\Users\jih19\AppData\Local\Programs\Python\Python311\Lib and there is nothing strange.
I included the path into hidden-import but still get the same error.

  1. I also try virtual space, and try to find the other problem but i really don't know what can i do

  2. Lastly, I tried to find what part of the code causes the error and found with some things like Multiprocessing no module named Error that the errors start with;

from openpyxl import Workbook
from openpyxl.utils import get_column_letter

if I erase that code it works normally.

Code:

import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import InvalidSessionIdException, StaleElementReferenceException
from selenium.common import exceptions
from user_agent import generate_user_agent, generate_navigator
import time
import random
from openpyxl import Workbook
from openpyxl.utils import get_column_letter

current_directory = os.getcwd()
wb = Workbook()
ws = wb.active
ws.title = "Seoul"
ws.sheet_properties.tabColor = "d9c2f0"
ws["A1"] = "Company_name"
ws["B1"] = "E-mail"
row_num = 2
column_num = 1
service = Service()
options = Options()
user_agent = generate_user_agent()
user_data = "C:/Users/jih19/Desktop/user_u"
options.add_argument(f"user-agent={user_agent}")
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = 'http://www.findjob.co.kr/job/category/areaJob.asp?HidArea=11'
driver.get(url)
#ements2 = driver.find_element(By.XPATH, '/html/body/div[7]/div/button/img')
#elements2.click()
#element1 = driver.find_element(By.XPATH, '//*[@id="gnb_wrap"]/div[1]/ul/li[2]')
#driver.execute_script("arguments[0].click();", element1)
elements = driver.find_elements(By.XPATH, '//*[@id="goods_speed"]/ul/li')

for element1 in elements:
    try:
        random_time = random.random()*1+0.5
        time.sleep(random_time)
        company=element1.find_element(By.XPATH,'./dl/dt/a')
        ws[f"A{row_num}"] = company.text
        company.click()
        old_window_handle = driver.window_handles[0]
        new_window_handle = driver.window_handles[-1]
        random_time = random.random()*1+0.6
        time.sleep(random_time)
        driver.switch_to.window(new_window_handle)
        email = driver.find_element(By.CLASS_NAME, 'email')
        email = email.get_attribute("title")
        ws[f"B{row_num}"] = email
        driver.close()
        random_time = random.random()*1+0.7
        time.sleep(random_time)
        driver.switch_to.window(old_window_handle)
        row_num += 1
        print(company.text, email)
    except InvalidSessionIdException:
        pass
    except NoSuchElementException:
        pass
    except StaleElementReferenceException:
        driver.close()
        random_time = random.random()*1+0.7
        time.sleep(random_time)
        driver.switch_to.window(old_window_handle)
        row_num += 1
        print(company.text)
ws.column_dimensions['A'].width = 30
ws.column_dimensions['B'].width = 30
file_name = "company_email.xlsx"
file_path = os.path.join(current_directory, file_name)
wb.save(file_path)
driver.quit()
moken
  • 3,227
  • 8
  • 13
  • 23
mino
  • 1

0 Answers0