Apologies, I am a beginning at Python and webscraping.
I am web scraping wugniu.com to extract readings for characters that I input. I made a list of 10273 characters to format into the URL and bring up the page with readings, then I used the Requests module to return the source code, then Beautiful Soup to return all the audio ids (as their strings contain the readings for the input characters - I couldn't use the text that comes up in the table as they are svgs). Then I tried to output the characters and their readings to out.txt.
# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
characters = [
#characters go here
]
output = open("out.txt", "a", encoding="utf-8")
tic = time.perf_counter()
for char in characters:
# Characters from the list are formatted into the url
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
page = requests.get(url, verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
for audio_tag in soup.find_all('audio'):
audio_id = audio_tag.get('id').replace("0-","")
#output.write(char)
#output.write(" ")
#output.write(audio_id)
#output.write("\n")
print(i)
time.sleep(60)
output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)
out.txt
is the output file I tried to output the results to. I measured the time the process took to measure performance.
However, after 50 or so loops, I get this in the cmd:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 96, in create_connection
raise err
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen httplib_response = self._make_request(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 353, in connect
conn = self._new_conn()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 181, in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\test.py", line 3282, in <module> page = requests.get(url, verify=False)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
I tried to fix this by adding time.sleep(60)
but the errors still happened. When I made this script yesterday, I was able to run it with a list of up to 1500 characters with no errors. Could someone please help me with this? Thanks.