find occurrences of a string in a large file that cannot fit memory

Question

I was asked to find the count of occurrences of the string "And" in a large file that is 10GB big and there is 1GB RAM. How would I do it efficiently. I answered that we need to read the file in memory chunks of 100MB each and then find the total occurences of "And" in each memory chunk and keep a cumulative count of the string "And". Interviewer was not satisfied with my answer and he told me how does the command grep work in unix. Write a code similar to that in python but I did not know the answer. I will appreciate answer to this question.

[This](http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho) might be helpful. — Sukrit Kalra, Jul 23 '13 at 05:24
Don't forget to check the boundaries if you are not reading by lines — John La Rooy, Jul 23 '13 at 05:33

score 5 · Answer 1 · answered Jul 23 '13 at 05:35

Iterating over the file, returns the lines. In this case it's easy because the search string doesn't contain end of line characters, so we don't need to worry about matches crossing over lines.

with open("file.txt") as fin:
    print sum(line.count('And') for line in fin)

Using str.count on each line

>>> help(str.count)
Help on method_descriptor:

count(...)
    S.count(sub[, start[, end]]) -> int

    Return the number of non-overlapping occurrences of substring sub in
    string S[start:end].  Optional arguments start and end are interpreted
    as in slice notation.

score 4 · Accepted Answer · answered Jul 23 '13 at 05:30

If you use generators you can access a big file and do the processing.

simple grep command,

def command(f):
    def g(filenames, **kwa):
        lines = readfiles(filenames)
        lines = (outline for line in lines for outline in f(line, **kwa))
        # lines = (line for line in lines if line is not None)
        printlines(lines)
    return g

def readfiles(filenames):
    for f in filenames:
        for line in open(f):
            yield line


def printlines(lines):
    for line in lines:
            print line.strip("\n")

@command
def grep(line, pattern):
    if pattern in line:
        yield line


if __name__ == '__main__':
    import sys
    pattern = sys.argv[1]
    filenames = sys.argv[2:]
    grep(filenames, pattern=pattern)

find occurrences of a string in a large file that cannot fit memory

2 Answers2