I have a python script that would take ~93 days to complete on 1 CPU, or 1.5 days on 64.
I have a large file (FOO.sdf) and would like to extract the "entries" from FOO.sdf that match a pattern. An "entry" is a block of ~150 lines delimited by "$$$$". The output desired is 600K blocks of ~150 lines. This script I have now is shown below. Is there a way to use multiprocessing or threading to divy up this task across many cores/cpus/threads? I have access to a server with 64 cores.
name_list = []
c=0
#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
for name in names:
name_list.append(name.strip())
#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:
#Opening the large file with many textblocks I don't want
with open("FOO.sdf",'r') as f:
#Loop through each line in the file
for line in f:
#Avoids appending extreanous lines or choking
if line.split() == []:
continue
#Simply, this line would check if that line matches any name in "name_list".
#But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
c=1 #when c=1 it designates that line should be written
#Write this line to output file
if c==1:
subset.write(line)
#Stop writing to file once we see "$$$$"
if c==1 and line.split()[0] == "$$$$":
c=0
subset.write(line)