0

I am developing a script to read through logs on a log folder and check on each .txt log file:

1) if the log has the string 'FileData' and
2) if the log does not have the string 'Error in File Data'

If this condition is met, it needs to read the file and collect the content of line 2. After some research on the topic, I found a solution to the problem and the script below works. The issue is that reading through 3000 files takes ~20min and with the number of files expected to grow very fast, this solution is unfeasible.

import os
import mmap
Dict = {}

for log in sorted(os.listdir(log_folder)):
        with open(os.path.join(log_folder, log), 'r') as f: 
            s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            if s.find(b'FileData') != -1 and s.find(b'Error in FileData') == -1:
                lines = [line for line in islice(f, 2)][:1]
                content = lines[1]
                Dict[log] = content

If I run this only with the first find ('FileData), it is very fast, but the moment I added the second find ('Error in FileData') time increased not linearly. Is there another way to do the same action but in a faster way? I tried re.findall() and readlines() but the result where too similar to this one.

Thanks!

  • Assuming the files are only written to by the program (no deletions or overwrites), you could save your position in each file – delyeet Feb 25 '20 at 16:05

1 Answers1

1

If the bottle neck is due to IO operations then multithreading should result in a speed increase. Untested.

import os
import threading
import mmap
Dict = {}

for log in sorted(os.listdir(log_folder)):
    threading.Thread(target=operate, args=(os.path.join(log_folder, log),)).start()

def operate(file):
    with open(file, 'r') as f: 
        s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        if s.find(b'FileData') != -1 and s.find(b'Error in FileData') == -1:
            lines = [line for line in islice(f, 2)][:1]
            content = lines[1]
            Dict[log] = content
El-Chief
  • 383
  • 3
  • 15
  • It might make more sense to use multi-processing due to the global interpreter lock – delyeet Feb 25 '20 at 16:03
  • Thanks for the reply @el-banto, but it did not really improved speed... I believe I need to change the logic on how many files to process and keep it close to or less than 3000 files to keep in a 'bearable' amount. – Rafael Castelo Branco Feb 27 '20 at 09:13