0

I have a code that open two files, save their contents to sets (set1 and set2) and save the results of a pairwise comparison between these sets to an output file. Both files are really big (more than 100K lines each) and this code is taking a long time to output (more than 10h).

Is there a way to optimize its performance?

def matches2smiles():
    with open('file1.txt') as f:
    set1 = {a.rstrip('\n') for a in f}

    with open('file2.txt') as g:
        set2 = {b.replace('\n', '') for b in g}

    with open('output.txt', 'w') as h: 
        r = [                                                                    
            h.write(b + '\n')
            for a in set1
            for b in set2
            if a in b
            ]

matches2smiles()
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Marcos Santana
  • 911
  • 5
  • 12
  • 21

1 Answers1

0

Your code is bogus in the first place, it should be:

    r = [                                                                    
        h.write(a + '\n')
        for a in set1
        if a in set2
        ]

Anyway, use set1.intersection(set2) - it will likely be faster and clearer code.

Zulan
  • 21,896
  • 6
  • 49
  • 109