Threading/Multiprocessing - Match searching a 60gb file with 600k terms

Question

I have a python script that would take ~93 days to complete on 1 CPU, or 1.5 days on 64.

I have a large file (FOO.sdf) and would like to extract the "entries" from FOO.sdf that match a pattern. An "entry" is a block of ~150 lines delimited by "$$$$". The output desired is 600K blocks of ~150 lines. This script I have now is shown below. Is there a way to use multiprocessing or threading to divy up this task across many cores/cpus/threads? I have access to a server with 64 cores.

name_list = []
c=0

#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
    for name in names:
        name_list.append(name.strip())

#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:

    #Opening the large file with many textblocks I don't want
    with open("FOO.sdf",'r') as f:

        #Loop through each line in the file
        for line in f:

            #Avoids appending extreanous lines or choking 
            if line.split() == []:
                continue

            #Simply, this line would check if that line matches any name in "name_list".
            #But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
            if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
                c=1 #when c=1 it designates that line should be written

            #Write this line to output file
            if c==1:
                subset.write(line)

            #Stop writing to file once we see "$$$$"
            if c==1 and line.split()[0] == "$$$$":
                c=0
                subset.write(line)

Is there a way for processes to talk to each other? I think I could solve this if there might be a way for a process to flick a switch "start writing" to the other processes. And then similarly, for a process to flick a switch saying "Okay stop". — Jacob Anderson, Apr 07 '20 at 17:29
For example if process 15 sees a match, it says "START" to all processes after it. And then once process 165 sees "$$$$" it says "STOP", to all processes. This way only the lines between the match and $$$$ would be written (?) — Jacob Anderson, Apr 07 '20 at 17:34
Maybe you should consider making `name_list` a `set` instead of a `list` since finding an element in a `set` is `O(1)` (big Oh) whereas finding an element in a `list` is `O(n)` where `n` is the number of elements in the list. This might give you a sufficient speed-up without using multiprocessing. It might be problematic to use multiprocessing when accessing disk as well, as reading/writing easily becomes a bottleneck, which won't be helped by multiple processes. — JohanL, Apr 07 '20 at 17:43
If you're concerned about performance, why not split the line once, instead of up to 5 times each iteration? Can you share some example data? It might be possible to use regex or something. — AMC, Apr 07 '20 at 19:53
Would love to go a more regex route to solve this problem, although just using a set and removing found elements got it down to an hour on 1 cpu. https://docs.google.com/document/d/1erwppFpi-65-dwQgl3dn7HYZ-HpcbcfAk2nAREiTNCs/edit?usp=sharing — Jacob Anderson, Apr 08 '20 at 00:35

Threading/Multiprocessing - Match searching a 60gb file with 600k terms

0 Answers0