3

I am attempting to create Pool() objects, so that I can break down large arrays. Though, each time after the first I run through the below code, the map is never run. Only the first pass seems to enter the function, though the arguments are the same size, even when running it using the EXACT same arguments - only the first

job.map(...)

appears to run. Below is the source of my pain (not all the code in the file):

def iterCount(): 
    #m is not in shared memory, as intended.
    global m  
    m = m + 1  
    return m 

def thread_search(pair): 
    divisor_lower = pair[0] 
    divisor_upper = pair[1] 
    for i in range(divisor_lower, divisor_upper,window_size): 
        current_section = np.array(x[i: i +  window_size]) 
        for row in current_section: 
            if (row[2].startswith('NP') ) and checkPep(row[0]): #checkPep is a simple unique-in-array checking function. 

            #shared_list is a multiprocessing.Manager list. 
            shared_list.append(row[[0,1,2]]) 
            m = iterCount() 
            if not m%1000000: 
                print(f'Encountered m = {m}', flush = True) 


def poolMap(pairs, group): 
    job = Pool(3) 
    print(f'Pool Created') 
    print(len(pairs)) 
    job.map(thread_search,pairs) 
    print('Pool Closed') 
    job.close() 

if __name__ == '__main__': 
    for group in [1,2,3]: #Example times to be run...   
        x = None 
        lower_bound = int((group - 1)*group_step) 
        upper_bound = int(group*group_step) 
        x = list(csv.reader(open(pa_table_name,"rt", encoding = "utf-8"), delimiter = "\t"))[lower_bound:upper_bound] 
        print(len(x))     
        divisor_pairs = [ [int(lower_bound + (i - 1)*chunk_size) , int(lower_bound + i*chunk_size)] for i in range(1,6143) ]  
        poolMap(divisor_pairs, group)

The output of this function is:

Program started: 03/09/19, 12:41:25 (Machine Time)
11008256 - Length of the file read in (in the group)
Pool Created
6142 - len(pairs)
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 3000000
Encountered m = 3000000
Encountered m = 3000000 (Total size is ~ 9 million per set)
Pool Closed
11008256 (this is the number of lines read, correct value)
Pool Created
6142 (Number of pairs is correct, though map appears to never run...) 
Pool Closed
11008256
Pool Created
6142
Pool Closed

At this point, the shared_list is saved, and only the first threads results appear to be present.

I'm really at a loss to what is happening here, and I've tried to find bugs (?) or similar instances of any of this.

Ubuntu 18.04 Python 3.6

  • Are you doing multiprocessing or multithreading? Global variables can't be shared using multiprocessing because each process runs in a separate memory-space — so there's a separate `global m` in each one. The are ways described in the documentation on how to workaround this limitation. – martineau Mar 09 '19 at 13:31
  • I am using the multiprocessing module. I understand that `global m` is unique per process, which is intended. All variables that need to be shared are created using the multiprocessing.Manager submodule, such as `shared_list`. Having shared memory-space variables is not the problem I believe, but rather the map fails to run a second time. – Alexander Hasson Mar 09 '19 at 22:11

0 Answers0