I was asked to find the count of occurrences of the string "And" in a large file that is 10GB big and there is 1GB RAM. How would I do it efficiently. I answered that we need to read the file in memory chunks of 100MB each and then find the total occurences of "And" in each memory chunk and keep a cumulative count of the string "And". Interviewer was not satisfied with my answer and he told me how does the command grep work in unix. Write a code similar to that in python but I did not know the answer. I will appreciate answer to this question.
Asked
Active
Viewed 3,292 times
3
-
2[This](http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho) might be helpful. – Sukrit Kalra Jul 23 '13 at 05:24
-
Don't forget to check the boundaries if you are not reading by lines – John La Rooy Jul 23 '13 at 05:33
2 Answers
5
Iterating over the file, returns the lines. In this case it's easy because the search string doesn't contain end of line characters, so we don't need to worry about matches crossing over lines.
with open("file.txt") as fin:
print sum(line.count('And') for line in fin)
Using str.count
on each line
>>> help(str.count) Help on method_descriptor: count(...) S.count(sub[, start[, end]]) -> int Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

John La Rooy
- 295,403
- 53
- 369
- 502
4
If you use generators you can access a big file and do the processing.
simple grep command,
def command(f):
def g(filenames, **kwa):
lines = readfiles(filenames)
lines = (outline for line in lines for outline in f(line, **kwa))
# lines = (line for line in lines if line is not None)
printlines(lines)
return g
def readfiles(filenames):
for f in filenames:
for line in open(f):
yield line
def printlines(lines):
for line in lines:
print line.strip("\n")
@command
def grep(line, pattern):
if pattern in line:
yield line
if __name__ == '__main__':
import sys
pattern = sys.argv[1]
filenames = sys.argv[2:]
grep(filenames, pattern=pattern)

John Prawyn
- 1,423
- 3
- 19
- 28