1

I have around 5000 files and I need to find words in each of them from a list of 10000 words. My current code uses a (very) long regex to do it, but it's very slow.

wordlist = [...list of around 10000 english words...]
filelist = [...list of around 5000 filenames...]
wordlistre = re.compile('|'.join(wordlist), re.IGNORECASE)
discovered = []

for x in filelist:
    with open(x, 'r') as f:
        found = wordlistre.findall(f.read())
    if found:
        discovered = [x, found]

This checks files at a rate of around 5 files per second, which is a lot faster than doing it manually, however it's still very slow. Is there a better way to do this?

Daffy
  • 841
  • 9
  • 23

3 Answers3

0

If you have access to grep on a command line, you can try the following:

grep -i -f wordlist.txt -r DIRECTORY_OF_FILES

You'll need to create a file wordlist.txt of all the words (one word per line).

Any lines in any of your files that match any of your words will be printed to STDOUT in the following format:

<path/to/file>:<matching line>
Sam Choukri
  • 1,874
  • 11
  • 17
  • p.s. `grep` is unix-based only. Use `findstr` in Windows-based system: https://technet.microsoft.com/en-us/library/cc732459.aspx – Raptor Apr 15 '15 at 06:36
0

Without more info on your data, a couple of thoughts are to use dictionaries instead of lists, and to reduce the data needed for searching/sorting. Also consider using re.split if your delimiters are not as clean as below:

wordlist = 'this|is|it|what|is|it'.split('|')
d_wordlist = {}

for word in wordlist:
    first_letter = word[0]
    d_wordlist.setdefault(first_letter,set()).add(word)

filelist = [...list of around 5000 filenames...]
discovered = {}

for x in filelist:
    with open(x, 'r') as f:
        for word in f.read():
            first_letter = word[0]
            if word in d_wordlist[first_letter]:
                discovered.get(x,set()).add(word)

return discovered
  • Better to make a set out of wordlist and let python do the search optimisation. Btw. [file.read()](https://docs.python.org/2/library/stdtypes.html#file.read) does not return a list of words. – swenzel Apr 15 '15 at 08:25
0

The Aho-Corasick algorithm was devised for precisely this usage, and implemented as fgrep in Unix. With POSIX, the command grep -F is defined to perform this function.

It differs from regular grep in that it only uses fixed strings (not regular expressions) and is optimized for searching for a large number of strings.

To run it on a large number of files, specify the precise files on the command line, or pass them through xargs:

xargs -a filelist.txt grep -F -f wordlist.txt

The function of xargs is to fill up the command line with as many files as possible, and run grep as many times as necessary;

grep -F -f wordlist.txt (files 1 through 2,500 maybe)
grep -F -f wordlist.txt (files 2,501 through 5,000)

The precise number of files per invocation depends on the length of the individual file names, and the size of the ARG_MAX constant on your system.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • It's not hard to find `grep` ports for Windows; but they may differ in quality. I would suggest looking for GNU `grep` for Windows. The first Google hit for me is http://gnuwin32.sourceforge.net/packages/grep.htm – tripleee Apr 15 '15 at 08:35
  • There are Python modules which implement this algorithm too, but I wouldn't know which one to recommend. https://pypi.python.org/pypi/ahocorasick/0.9 is the top Google hit for me but it has a lower version number and looks less polished than https://pypi.python.org/pypi/pyahocorasick/1.0.0 – tripleee Apr 15 '15 at 08:42