3

Considering a text file of 1.5 million lines and around 50-100 words per line.

To find the lines that contains the word, using os.popen('grep -w word infile') seems to be faster than

for line in infile: 
  if word in line:
    print line

How else could one search a word in a textfile in python? What is the fastest way to search through that large unindex textfile?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • I think that using a regex could be very fast. But as your file is very big, it can't be loaded in RAM to be regex-analyzed. However, it is possible to read the file by large chunks and analyze with regex each chunk one after the other. Doing so, it may arrive that the researched string could be overlapping over two chunks, then being not detected. Consequently the analyze of chunks must be done in a certain manner. I wrote such a code already and posted it here on stackoverflow.com. Let me search it. – eyquem Jul 08 '13 at 07:22
  • 1
    I found my following post (http://stackoverflow.com/questions/16583591/read-a-very-big-single-line-txt-file-and-split-it) in which the code was intended to detect strings ROW_DEL in a big file and to replace them with shorter strings. Your problem is just to detect a pattern, it is simpler. I think you can take a look in my cited post to examine the manner in which I analyzed the text chunk after chunk and to adapt its principle to your more limited need. – eyquem Jul 08 '13 at 07:53

2 Answers2

2

There are several fast search algorithms (see wikipedia). They require you to compile the word into some structure. Grep is using Aho-Corasick algorithm.

I haven't seen the source code for python's in but either

  1. word is compiled for each line which takes time (I doubt in compiles anything, obviously it can compile it, cache the results, etc), or
  2. the search is inefficient. Consider searching for "word" in "worword" you first check "worw" and fail, then you check "o", then "r" and fail, etc. But there is no reason to recheck "o" or "r" if you are smart. So for example, Knuth–Morris–Pratt algorithm creates a table based on the searched word that tells it how many characters can be skipped when fail occurs.
Jirka
  • 4,184
  • 30
  • 40
1

I may recommend to install and use the_silver_searcher.

In my test it searched ~ 1GB text file with ~ 29 million lines and found hundreds of searched word entries in only 00h 00m 00.73s, i.e. LESS than a second!

Here is Python 3 code which uses it to search for word and count number of times it was found:

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

This version searches for word and prints line numbers + actual text were the word was found:

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))
Denis Rasulev
  • 3,744
  • 4
  • 33
  • 47