0

The basic purpose of my script is to filter through a range of numbers (say, 5000),the numbers that are valid are saved to a list called hit_list. The real range I'm looping through is much bigger than 5000, so I need concurrency to make the time manageable.

I don't know the proportion of valid numbers in any given range, so when my (threaded) script returned 9 numbers to hit_list I didn't question it. However as a final check I ran the script without threads, just like a normal script. It returned 214 numbers to hit_list!

EDIT: To be clear, the problem is that numbers are not being found correctly, rather than not being stored correctly.

I've been very generously helped with the construction of this programme, both on SO,here and Reddit,here.

Below is the script with threads. I suspect the problem is something to do with locking (though I was under the impression that concurrent.futures solved this problem automatically) or maybe with the number of workers/chunks. But as I'm sure you can tell by now, I'm a beginner, so it could be anything!

import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime
import xlwt

hit_list =[]
print('-List Created')
startrange= 100000000
end_range = 100005000
startTime = datetime.now()
print(datetime.now())
url = 'https://ndber.seai.ie/pass/ber/search.aspx'

#print('Working...')

def id_filter(_range):

    with requests.session() as s:
        s.headers.update({
            'user-agent': 'For more information on this data collection please contact #########'
        })

        r = s.get(url)
        time.sleep(.5)


        soup = BeautifulSoup(r.content, 'html.parser')
        viewstate    = soup.find('input', {'name': '__VIEWSTATE'          }).get('value')
        viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
        validation   = soup.find('input', {'name': '__EVENTVALIDATION'    }).get('value')

        for ber in _range:            


            data = {
            'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
            'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
            '__VIEWSTATE'                                         : viewstate,
            '__VIEWSTATEGENERATOR'                                : viewstategen,
            '__EVENTVALIDATION'                                   : validation,
        }

            y = s.post(url, data=data)

            if 'No results found' in y.text:
                #print('Invalid ID Number')
                pass
            else:
                #print('Valid ID Number')
                hit_list.append(ber)




if __name__=='__main__': #not 100% clear on what exactly this does, but that's a lesson for another day.

#Using threads to call the function    
workers = 20
with cf.ThreadPoolExecutor(max_workers=workers) as e:

    IDs = range(startrange,end_range)
    cs = 20 
    ranges = [IDs[i+1 :i+cs] for i in range(-1, len(IDs), cs)]
    results = e.map(id_filter, ranges)
#below is code for saving the data to an excel file, I've left it out for parsimony.
Community
  • 1
  • 1
SeánMcK
  • 392
  • 3
  • 17
  • Are the hits not found correctly or not stored correctly? Can you reproduce the issue without using any HTTP requests but dummies instead? (Hint: You then don't have a minimal example!) BTW: The code isn't indented properly. – Ulrich Eckhardt May 25 '16 at 17:28
  • To expand on @UlrichEckhardt's comment: first verify, by hand if necessary, what the correct answer is. Then verify that your single-threaded version gives that answer. Then run your multi-threaded version. My guess is that any differences seen will be caused by poor handling of multi-writer access to a global. – Peter Rowell May 25 '16 at 18:04
  • @UlrichEckhardt, thanks for your suggestion. I've edited my question to clarify the issue: it is indeed a problem with finding, rather than storing, hits. I have checked and confirmed this by following Peter Rowell 's advice to manually check whether the single-threaded version (I did this 200+times...it took a while). When you say dummies, are you referring to multiprocessing.dummy? I'll look into that now. I've also fixed the indentation issue. – SeánMcK May 26 '16 at 08:18
  • @PeterRowell, I followed your suggestion and manually inputed 200+ iterations. The problem is finding numbers, the multi-threaded version doesn't find them for some reason. I think this suggests your theory is wrong as if I understand correctly, you thought it might be a problem with several threads accessing `hit_list` at the same time? – SeánMcK May 26 '16 at 08:21
  • @sean_raven: I haven't spent the time to do a deep analysis of your code. I do know that if version A produces correct results, and version B doesn't, then the problem lies somewhere in the differences between A and B. If with 1 thread `id_filter` works OK (but slowly) on the whole file, but using more than 1 thread gives a wrong result, then either the slices you are feeding `id_filter` are not what you think they are, or there is some other aspect of the threading environment causing problems. Does it work better or worse with 2 workers, than 1? With 3 rather than 2? A pattern will emerge. – Peter Rowell May 26 '16 at 21:24
  • Hi Peter, I appreciate you coming back to me on this. I spent all day yesterday rebuilding the code from the ground up. I think I was able to get a working version. I'm going to spend a few hours testing today, hopefully it'll do the job though! Thanks again (I'll obviously post the correct code here, once I've confirmed) – SeánMcK May 27 '16 at 07:26
  • What's supposed to communicate `hit_list` back to the main program? `map` normally operates on functions that return something. 9/214≈1/20 hits with 20 workers is ballpark for one of the workers running in the main context. Incidentally, that's why the `if __name__=='__main__'` is necessary; the other workers import the module on win32 (on posix they fork). – Yann Vernier May 27 '16 at 09:12
  • @YannVernier funnily enough I read about what the need to `return` something while threading yesterday. But my answer below seems to be able to work fine without returning anything. Thoughts? – SeánMcK May 27 '16 at 09:19

1 Answers1

0

This solution is a re-written, rather than debugged version of my original code. Re-building from the ground up was both easier and more educational for me- but I can't tell you what went wrong with the original code.

The script seems to work well- it picks up the magic number of 214 hits in a loop of 5000.

Although it does answer my question, the code is far from perfect and I'm happy to accept answers that build/improve upon it.

Not being sure what the SO etiquette for asking questions within answers is, I'm calling the following optional discussion points (rather than questions!):

  1. Thoughts on requests add-on: It seems to be a much more straightforward version of what I've tried to do?
  2. Best way to optimise efficiency of threads? Are there any rules of thumb regarding the ratio of workers to chunks, or is it just trial & error?
  3. Best way to raise exceptions while using requsts & threading?

Working (but imperfect) Solution

import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
from datetime import datetime



hit_list =[]
processed_list=[]
startrange= 100000000
end_range = 100005000

loop_size=range(startrange,end_range)
workers= 5
chunks= 20
startTime = datetime.now()



def id_filter(_range):
    url = 'https://ndber.seai.ie/pass/ber/search.aspx'
    with requests.session() as s:
        s.headers.update({
            'user-agent': 'For more information on this data collection please contact ######'
        })    
        r = s.get(url)   

        soup = BeautifulSoup(r.content, 'html.parser')
        viewstate    = soup.find('input', {'name': '__VIEWSTATE'          }).get('value')
        viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
        validation   = soup.find('input', {'name': '__EVENTVALIDATION'    }).get('value')

        for ber in _range:            
            data = {
                'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
                'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
                '__VIEWSTATE'                                         : viewstate,
                '__VIEWSTATEGENERATOR'                                : viewstategen,
                '__EVENTVALIDATION'                                   : validation,
            }        
            y = s.post(url, data=data)            
            if 'No results found' in y.text:
                processed_list.append(ber)
                #print('Invalid ID', ber)
            else:
                hit_list.append(ber)
                processed_list.append(ber)
                print('Valid ID',ber)

if __name__ == '__main__':   
    with cf.ThreadPoolExecutor(max_workers=workers) as boss:
        jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]        
        boss.map(id_filter, jobs)

#note below is a simplified version of what I'm doing in real-life. In reality, I print out details similar to these and then save them in a .txt file. I left out all that for parsimony's sake.
print('Details')
run_time = datetime.now() - startTime
print('Start Range',startrange)
print('End Range', end_range)
print('Run Time', run_time) 
print('Number of Hits', len(hit_list))
print('Number of processed IDs', len(processed_list)) 
print('Number of Workers', workers)
print('Job Size:',jobs)
SeánMcK
  • 392
  • 3
  • 17
  • I expect this works because you use ThreadPoolExecutor rather than ProcessPoolExecutor (which is more like what multiprocessing does). id_filter still operates by side effect, filling in hit_list and processed_list, instead of returning values that would be collected by map. TheadPoolExecutor.map even collects and communicates exceptions as part of the returned sequence. – Yann Vernier May 27 '16 at 09:21
  • @YannVernier Not sure I follow. 1) My understanding was I needed threading, rather than multiproccessing because this is an I/O programme, not CPU bound one. 2) Are you saying this just happens to work because `ThreadPoolExecutor.map` collects exceptions- which in this case would be `hit_list` & `processed_list`? If so, is there anything fundamentally wrong with taking advantage of this flexibility? – SeánMcK May 27 '16 at 09:32
  • 1) Threading works because it shares the Python namespace, so each of your id_filter calls access the same hit_list. With multiprocessing, each worker will have its own. You're also I/O bound so threading doesn't hurt despite Python's global interpreter lock (which specifically exists so the threads can update shared things without breaking). 2) No, the collection of exceptions is just a bonus for debugging, where you can iterate through a pool.map and receive the same exceptions you would with map. – Yann Vernier May 27 '16 at 09:48