I have around 5000 files and I need to find words in each of them from a list of 10000 words. My current code uses a (very) long regex to do it, but it's very slow.
wordlist = [...list of around 10000 english words...]
filelist = [...list of around 5000 filenames...]
wordlistre = re.compile('|'.join(wordlist), re.IGNORECASE)
discovered = []
for x in filelist:
with open(x, 'r') as f:
found = wordlistre.findall(f.read())
if found:
discovered = [x, found]
This checks files at a rate of around 5 files per second, which is a lot faster than doing it manually, however it's still very slow. Is there a better way to do this?