This code is for crawling, finding and and scraping on a particular site.
The steps are as follows;
- Create an Excel file and fill in row 1
- Setup for selenium
- Click certain Xpath and back to the original page using "for"
- Finding the Xpath class's name
email
and add into Excel - Pass the error with
try/except
- Re-size the Excel cell
- Use "os.path.join" for Excel file path at same location
- Save and quit
The codes are working well, however when I transformed into exe with pyinstaller or auto-py-to-exe I get errors returned.
The Python version is 3.11.4 and other packages are all recent releases.
- While converting to an exe file an Error is returned;
2570 WARNING: lib not found: pywintypes311.dll dependency of C:\Users\jih19\AppData\Roaming\Python\Python311\site-packages\win32\win32pdh.pyd
so I found the win32pdh.pyd and add the file into hidden import
- Also when running the
.exe
file, the following error occurs;
Traceback (most recent call last):
File "PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 109, in <module>
File "PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 19, in _pyi_rthook
ModuleNotFoundError: No module named 'multiprocessing.spawn'; 'multiprocessing' is not a package
[34052] Failed to execute script 'pyi_rth_multiprocessing' due to unhandled exception!
I also find the multiprocessing folder and it is located in C:\Users\jih19\AppData\Local\Programs\Python\Python311\Lib
and there is nothing strange.
I included the path into hidden-import but still get the same error.
I also try virtual space, and try to find the other problem but i really don't know what can i do
Lastly, I tried to find what part of the code causes the error and found with some things like
Multiprocessing no module named Error
that the errors start with;
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
if I erase that code it works normally.
Code:
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import InvalidSessionIdException, StaleElementReferenceException
from selenium.common import exceptions
from user_agent import generate_user_agent, generate_navigator
import time
import random
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
current_directory = os.getcwd()
wb = Workbook()
ws = wb.active
ws.title = "Seoul"
ws.sheet_properties.tabColor = "d9c2f0"
ws["A1"] = "Company_name"
ws["B1"] = "E-mail"
row_num = 2
column_num = 1
service = Service()
options = Options()
user_agent = generate_user_agent()
user_data = "C:/Users/jih19/Desktop/user_u"
options.add_argument(f"user-agent={user_agent}")
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = 'http://www.findjob.co.kr/job/category/areaJob.asp?HidArea=11'
driver.get(url)
#ements2 = driver.find_element(By.XPATH, '/html/body/div[7]/div/button/img')
#elements2.click()
#element1 = driver.find_element(By.XPATH, '//*[@id="gnb_wrap"]/div[1]/ul/li[2]')
#driver.execute_script("arguments[0].click();", element1)
elements = driver.find_elements(By.XPATH, '//*[@id="goods_speed"]/ul/li')
for element1 in elements:
try:
random_time = random.random()*1+0.5
time.sleep(random_time)
company=element1.find_element(By.XPATH,'./dl/dt/a')
ws[f"A{row_num}"] = company.text
company.click()
old_window_handle = driver.window_handles[0]
new_window_handle = driver.window_handles[-1]
random_time = random.random()*1+0.6
time.sleep(random_time)
driver.switch_to.window(new_window_handle)
email = driver.find_element(By.CLASS_NAME, 'email')
email = email.get_attribute("title")
ws[f"B{row_num}"] = email
driver.close()
random_time = random.random()*1+0.7
time.sleep(random_time)
driver.switch_to.window(old_window_handle)
row_num += 1
print(company.text, email)
except InvalidSessionIdException:
pass
except NoSuchElementException:
pass
except StaleElementReferenceException:
driver.close()
random_time = random.random()*1+0.7
time.sleep(random_time)
driver.switch_to.window(old_window_handle)
row_num += 1
print(company.text)
ws.column_dimensions['A'].width = 30
ws.column_dimensions['B'].width = 30
file_name = "company_email.xlsx"
file_path = os.path.join(current_directory, file_name)
wb.save(file_path)
driver.quit()