Speed up a routine inside a process with high memory usage by using parallel processing in python

Question

I have a program that uses a lot of memory (150gb of vectors indexed with nmslib) and I have trouble to parallelize the execution of the code. My machine has 40 cores and my attempts to parallelize it have been not successful so far. the programm first loads the vectors and then prepares some data based on the vectors (this part is fine and performance is good, because most of the workload is done by nmslib which is multithreaded by itself). Now the trouble starts when I postprocess the data that has been loaded into RAM by nmslib. I am iterating over a list that has 500 entries, each representing the data of one file. The code I use to process this data and that I try to execute in parallel is the following routine:

def tib_result_turn_to_file(data):
fileindex = data[0]
main_bucket = data[1]
result_string = ""
print("Now processing: " + fileindex[0])
print(abs(fileindex[1]-fileindex[2]))
#print(len(results))
c = fileindex[1]
c1 = 0
while c < fileindex[2]:
    if main_bucket == "tengyur1":
        tibwords = tibwords_tengyur1
    if main_bucket == "tengyur2":
        tibwords = tibwords_tengyur2
    if main_bucket == "kangyur":
        tibwords = tibwords_kangyur
    result_string += "\n" + main_bucket + "#" + fileindex[0] + " " + str(c)
    for result in data[2][c1]:
        bucket = result[2]
        if bucket == "tengyur1":
            tibwords = tibwords_tengyur1
        if bucket == "tengyur2":
            tibwords = tibwords_tengyur2
        if bucket == "kangyur":
            tibwords = tibwords_kangyur
        bucket = result[2]
        result_position = result[1]
        result_score = result[0]
        # we don't want results that score too low or that are identical with the source:
        if result_score < 0.03 and (result_position < c- 20 or result_position > c + 20):
                result_string += "\t"  + bucket + "#" + tibwords[result_position][0] + "#" + str(result_position)
    c += 1
    c1 += 1
with open("/mnt/output_parallel/" + fileindex[0][:-4] + "_parallel_index.org", "w") as text_file:
    text_file.write(result_string)

the lists starting with tibwords are huge lists of each 50 million entries. They are defined in the parenting routine and are not altered in this routine so I assume they will not be copied. Now each batch of data that gets fed into this routine is not small itself, if I pickle that I might get some 500mb on the average. Since this routine's sole purpose is to produce a side-effect by writing a file at the end of it's execution and since it doesn't modify any data that might be shared with other threads, I assume it should be pretty straight forward to parallelize it. However so far nothing worked. What I tried:

Parallel(n_jobs=40,backend="threading")(delayed(tib_result_turn_to_file)(i,bucket) for i in files)

This seems to create a lot of threads, but they don't seem to do much. I guess the GIL is getting in the way, at best one core is used.

Parallel(n_jobs=40)(delayed(tib_result_turn_to_file)(i,bucket) for i in files)

This will break because it complains about memory use. If I add the option require='sharedmem' it will run, but then it is as slow as the previous attempt. The third solution:

pool = multiprocessing.Pool(processes=4)
pool.map(tib_result_turn_to_file,files,bucket)
pool.close()

Will fail with an OOM. I do not understand why though. The data that is accessed inside the routine is all read-only and even if I reduce the routine to a:

def tib_result_turn_to_file(data):
    print("Hello world")

Pool will fail with OOM. Is this because I load the huge indicies in the previous section of the program (which at this time are still in memory, but not used anymore)? If that should be the reason, is there any possible way to get around this problem? Is my approach right after all? I wonder whether I should split this program into two, do the vector-operations with nmslib first and the postprocessing in a second step, but that would add a lot of unwanted complexity in my opinion.

score 0 · Answer 1 · answered Sep 05 '18 at 05:22

The solution is indeed that I had to delete all the previously creates nmslib indices before I can parallelize the further execution of the routine, even though these variables are not used inside the new processes. Seems as if python cannot help but copy everything to the new processes.

Speed up a routine inside a process with high memory usage by using parallel processing in python

1 Answers1