Python - How to search a string in a large file

Question

I have a large file that can have strings like file_+0.txt, file_[]1.txt, file_~8.txt etc.

I want to find the missing files_*.txt until a certain number.

For example if I give the below file and a number 5, it should tell that the missing ones are 1 and 4

asdffile_[0.txtsadfe
asqwffile_~2.txtsafwe
awedffile_[]2.txtsdfwe
qwefile_*0.txtsade
zsffile_+3.txtsadwe

I wrote a Python script to which I can give the file path and a number and it will give me all file names that are missing until that number.

My program works for small files. But when I give a large file (12MB) that can have file numbers until 10000, it just hangs.

Here is my current Python code

#! /usr/bin/env/python
import mmap
import re

def main():
    filePath = input("Enter file path: ")
    endFileNum = input("Enter end file number: ")
    print(filePath)
    print(endFileNum)
    filesMissing = []
    filesPresent = []
    f = open(filePath, 'rb', 0)
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    for x in range(int(endFileNum)):
        myRegex = r'(.*)file(.*)' + re.escape(str(x)) + r'\.txt'
        myRegex = bytes(myRegex, 'utf-8')
        if re.search(myRegex, s):
            filesPresent.append(x)
        else:
            filesMissing.append(x)
    #print(filesPresent)
    print(filesMissing)

if __name__ == "__main__":
    main()

Output hangs when I give a 12MB file which can have files from 0 to 9999

$python findFileNumbers.py
Enter file path: abc.log
Enter end file number: 10000

Output for a small file (same as the above example)

$python findFileNumbers.py
Enter file path: sample.log
Enter end file number: 5
[0, 2, 3]
[1, 4]

How can I make this work for big files?
Is there a better way I can get these results instead of a Python script?

Thanks in advance!

Big in terms of what? The number of files to search through, the size of the data in the file, the length of its name? — bgfvdu3w, Sep 28 '17 at 19:27
I gave a 12MB file as input and the number of files it can search through is 10,000 — SyncMaster, Sep 28 '17 at 19:34
There is no need to map the files to memory if you just need to get their names. — bgfvdu3w, Sep 28 '17 at 19:37
I am not getting the names of the files. I have one input file and I am getting matching strings within that file. Can you please check my example above — SyncMaster, Sep 28 '17 at 19:39

score 2 · Accepted Answer · answered Sep 28 '17 at 19:43

first collect the existing ones in a set and then look for the missing ones.

my_regex = re.compile('.*file.*(\d+)\.txt.*')
present_ones = set()
for line in open(filepath):
    match = my_regex.match(line)
    if match:
       present_ones.add(int(match.group(1)))
for num in range(...):
    if num not in present_ones:
        print("Missing" + num)

The reason yours hangs because you are going through the entire file for each number. i.e 12MB * 10000 = 120GB The script is going through 120GB and so it hangs even if you have it in mmap.

You need to use `.*?` in your regex if OP's implication that multiple numbers can occur on one line are correct. — Mad Physicist, Sep 28 '17 at 19:57

score 1 · Answer 2 · answered Sep 28 '17 at 19:46

I would suggest that you simply read through the input file line by line and parse each of the lines for its file number. Then use that file number as an index into a boolean array set False initially.

You don't do any processing that requires the file to be in memory. This approach will work for very large files.

#~ import mmap
import re
import numpy as np

def main():
    #~ filePath = input("Enter file path: ")
    filePath = 'filenames.txt'
    #~ endFileNum = input("Enter end file number: ")
    endFileNum = 5
    print(filePath)
    print(endFileNum)
    found = np.zeros(1+endFileNum, dtype=bool)
    patt = re.compile(r'[^\d]+(\d+)')
    with open(filePath) as f:
        for line in f.readlines():
            r = patt.search(line).groups(0)[0]
            if r:
                found[int(r)]=True
    print (found)

    #~ filesMissing = []
    #~ filesPresent = []
    #~ files = np.zeros[endFileNum, dtype=bool]
    #~ f = open(filePath, 'rb', 0)
    #~ s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    #~ for x in range(int(endFileNum)):
        #~ myRegex = r'(.*)file(.*)' + re.escape(str(x)) + r'\.txt'
        #~ myRegex = bytes(myRegex, 'utf-8')
        #~ if re.search(myRegex, s):
            #~ filesPresent.append(x)
        #~ else:
            #~ filesMissing.append(x)
    #print(filesPresent)
    #~ print(filesMissing)

if __name__ == "__main__":
    main()

This produces the following result from which your filesPresent and filesMissing are easily recovered.

filenames.txt
5
[ True False  True  True False False]

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

Let's take a look at what you are actually doing here:

Memory map the file.
For each number

a. Compile a regular expression for that number.
b. Search for the regular expression in the entire file.

This is very inefficient for large numbers. While memory mapping gives you a string-like interface to the file, it is not magic. You still have load chunks of the file to move around within it. At the same time, you are making a pass, potentially over the entire file, for each regex. And regex matching is expensive as well.

The solution here would be to make a single pass through the file, line by line. You should pre-compile the regular expression instead of compiling it once per number if you have a large number to search for. To get all the numbers in a single pass, you could make a set of all the numbers up to the one you want, called "missing", and an empty set called "found". Whenever you encounter a line with a number, you would move the number from "missing" to "found".

Here is a sample implementation:

filePath = input("Enter file path: ")
endFileNum = int(input("Enter end file number: "))
missing = set(range(endFileNum))
found = set()
regex = re.compile(r'file_.*?(\d+)\.txt')
with open(filePath) as file:
    for line in file:
        for match in regex.finditer(line)
            num = int(match.groups(1))
            if num < endFileNum:
                found.add(num)
missing -= found

Notice that the regular expression uses the reluctant quantifier .*? after file_. This will match as few characters as possible before looking for a digit. If you have the default greedy quantifier of .*, multiple numbers on one line would match only the last one.

Python - How to search a string in a large file

3 Answers3