The basic purpose of my script is to filter through a range of numbers (say, 5000),the numbers that are valid
are saved to a list called hit_list
. The real range I'm looping through is much bigger than 5000, so I need concurrency to make the time manageable.
I don't know the proportion of valid numbers in any given range, so when my (threaded) script returned 9 numbers to hit_list
I didn't question it. However as a final check I ran the script without threads, just like a normal script. It returned 214 numbers to hit_list
!
EDIT: To be clear, the problem is that numbers are not being found correctly, rather than not being stored correctly.
I've been very generously helped with the construction of this programme, both on SO,here and Reddit,here.
Below is the script with threads. I suspect the problem is something to do with locking (though I was under the impression that concurrent.futures solved this problem automatically) or maybe with the number of workers/chunks. But as I'm sure you can tell by now, I'm a beginner, so it could be anything!
import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime
import xlwt
hit_list =[]
print('-List Created')
startrange= 100000000
end_range = 100005000
startTime = datetime.now()
print(datetime.now())
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
#print('Working...')
def id_filter(_range):
with requests.session() as s:
s.headers.update({
'user-agent': 'For more information on this data collection please contact #########'
})
r = s.get(url)
time.sleep(.5)
soup = BeautifulSoup(r.content, 'html.parser')
viewstate = soup.find('input', {'name': '__VIEWSTATE' }).get('value')
viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
validation = soup.find('input', {'name': '__EVENTVALIDATION' }).get('value')
for ber in _range:
data = {
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewstategen,
'__EVENTVALIDATION' : validation,
}
y = s.post(url, data=data)
if 'No results found' in y.text:
#print('Invalid ID Number')
pass
else:
#print('Valid ID Number')
hit_list.append(ber)
if __name__=='__main__': #not 100% clear on what exactly this does, but that's a lesson for another day.
#Using threads to call the function
workers = 20
with cf.ThreadPoolExecutor(max_workers=workers) as e:
IDs = range(startrange,end_range)
cs = 20
ranges = [IDs[i+1 :i+cs] for i in range(-1, len(IDs), cs)]
results = e.map(id_filter, ranges)
#below is code for saving the data to an excel file, I've left it out for parsimony.