Looks like you need a memory efficient data structure to store keywords and later check if they exist in file2.txt or not. I suggest using a Trie data structure so that your memory doesn't run out.
Also, avoid reading the whole file at once, read the contents of the keyword file one line at a time and insert it into the trie.
A quick refactor of your code might look like below:
from colletions import defaultdict
class Node:
def __init__(self):
self.children = defaultdict(Node)
self.end = False
class Trie:
def __init__(self):
self.root = Node()
def insert(self, word: str) -> None:
cur = self.root
for c in word:
cur = cur.children[c]
cur.end = True
def contains(self, word: str) -> bool:
cur = self.root
for c in word:
if c not in cur.children: return False
cur = cur.children[c]
return cur.end
# changes to your code below:
keywords = Trie()
with open('file2.txt', 'r') as k:
for keyword in k:
keywords.insert(keyword)
#2
with open('file1.txt') as f, open('output.txt', 'w') as o:
for line in f:
if keywords.contains(line):
o.writelines(line)
Note: A possible disadvantage of using a Trie here is that if the keywords are unique, there might be very less overlap and the memory used by a trie might still be a lot. But still it should be significantly less than 4GB.