0

I am developing a string filter for huge process log file in distributed system.

These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that.

 def FindNMsgStart(self,logfile):

        self.logfile = logfile

        lf = LogFilter()

        infile = lf.OpenFile(logfile, 'Input')
        NMsgBlockStart = list()


        for num, line in enumerate(infile.readlines()):
            if re.search('SMsg{', line):                
                NMsgBlockStart.append(num)


        return NMsgBlockStart

This is my lookup function to search any kind of word in the text file.

def Lookup(self,infile,regex,start,end):

        self.infile = infile
        self.regex = regex
        self.start = start
        self.end = end
        result = 0


        for num, line in enumerate(itertools.islice(infile,start,end)):
            if re.search(regex, line):
                result = num + start
                break




        return result        

Then I will get that list and find the end for each starting block through the whole file. Following is my code for find the end.

def FindNmlMsgEnd(self,logfile,NMsgBlockStart):

        self.logfile = logfile
        self.NMsgBlockStart = NMsgBlockStart

        NMsgBlockEnd = list()

        lf = LogFilter() 

        length = len(NMsgBlockStart)


        if length > 0:
            for i in range (0,length):
                start=NMsgBlockStart[i]                
                infile = lf.OpenFile(logfile, 'Input')
                lines = lf.LineCount(logfile, 'Input')
                end = lf.Lookup(infile, '}', start, lines+1)               
                NMsgBlockEnd.append(end)


            return NMsgBlockEnd
        else:
            print("There is no Normal Message blocks.") 

But those method are never efficient enough to handle huge files. The program is running long time without a result.

  1. Is there efficient way to do this?
  2. If yes, How could I do this?

I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. Please help me.

Remi Guan
  • 21,506
  • 17
  • 64
  • 87
  • 1
    use .find instead of re.search – YOU Sep 19 '15 at 06:02
  • If you need high-performance, Python is the wrong tool for the job. – Nir Alfasi Sep 19 '15 at 06:06
  • Is there a reason to be searching for the start and end of your blocks separately? It seems like it would be vastly more efficient to find both ends in a single scan, rather than scanning for all the starts, then needing to go back and search for the ends. – Blckknght Sep 19 '15 at 06:20
  • Is putting all lines in to dictionary reduce the cost? – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 06:21
  • If the computation of one result requires access to all the lines in the file, your improvements will be less significant (though some obvious rather significant optimizations have been pointed out); but if you can compute a result from one entry (however you define it) without access to the entire file, not needlessly reading the whole file into memory is going to be a real game-changer. – tripleee Sep 19 '15 at 07:58

1 Answers1

2

I see a couple of issues that are slowing your code down.

The first seems to be a pretty basic error. You're calling readlines on your file in the FindNMsgStart method, which is going to read the whole file into memory and return a list of its lines.

You should just iterate over the lines directly by using enumerate(infile). You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight.

The second issue is a bit more complicated. It involves the general architecture of your search.

You're first scanning the file for message start lines, then searching for the end line after each start. Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. It would be a lot more efficient if you could combine both searches into a single pass over the data file.

Here's a really crude generator function that does that:

def find_message_bounds(filename):
    with open(filename) as f:
        iterator = enumerate(f)
        for start_line_no, start_line in iterator:
            if 'SMsg{' in start_line:
                for end_line_no, end_line in iterator:
                    if '}' in end_line:
                        yield start_line_no, end_line_no
                        break

This function yields start, end line number tuples, and only makes a single pass over the file.

I think you can actually implement a one-pass search using your Lookup method, if you're careful with the boundary variables you pass in to it.

def FindNmlMsgEnd(self,logfile,NMsgBlockStart):

    self.logfile = logfile
    self.NMsgBlockStart = NMsgBlockStart

    NMsgBlockEnd = list()

    lf = LogFilter() 
    infile = lf.OpenFile(logfile, 'Input')
    total_lines = lf.LineCount(logfile, 'Input')

    start = NMsgBlockStart[0]
    prev_end = -1
    for next_start in NMsgBlockStart[:1]:
        end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
        NMsgBlockEnd.append(end)

        start = next_start
        prev_end = end

    last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
    NMsgBlockEnd.append(last_end)

    return NMsgBlockEnd

It's possible I have an off-by-one error in there somewhere, the design of the Lookup function makes it difficult to call repeatedly.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Thank you. But I have to read this file again and again because there are some complex message blocks like json arrays of arrays.To find the end of that blocks I have implement different methods.That's why I implement these methods separately – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 06:43
  • I have no much knowledge about this generator objects. How do I print the value of this? – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 07:11
  • A generator works like an iterator. You can loop over it with `for`, or pass it to a function that expects an iterable object (like `list`, if you want to be able to see all the values at once). – Blckknght Sep 19 '15 at 07:14
  • Thank you. can I include regular expressions for this function? Will it be slow down this? – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 07:56
  • In the generator? Sure, just replace the `if` conditions with the equivalent regex search. I just did a simple substring match because that's all that's needed for your start and end lines as you've presented them. One other issue that's occurred to me. If a start and end might occur on the same line, you'll need to check for that before starting the inner loop. – Blckknght Sep 19 '15 at 08:07
  • It 'll be never happen as this log file structure start and end definitely in separate lines. Is adding regular expression will higher the running time of the program? Is it good or bad? – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 08:24
  • I have modified your function for matching some specific words inside the message blocks. Here is the python fiddle of that modified method. https://jsfiddle.net/ak9p436p/ But It will not return the end of the message block correctly. I couldn't figure out the error. – Tharindu Ramesh Ketipearachchi Sep 19 '15 at 19:35
  • I think the last `break` is probably inappropriate, as it can cause the inner loop to stop without yielding anything. I think it will always be hit on the line of the last parameter in the file. There are some other brittle parts to the code. If a block doesn't have all the requested parameters, the search will go on beyond the end marker, probably until the missing parameters show up in another block's lines. I'd suggest reversing the nesting of the last two ifs, so you're always looking for the end block marker, but reporting an error or something if you don't have all the parameters yet. – Blckknght Sep 20 '15 at 05:27
  • Can you edit me my code? I'm really horrible with iterator stuffs. If you can edit it . It's really helpful.Can you put a corrected code in this question? http://stackoverflow.com/questions/32672629/how-match-multiple-strings-at-once-in-python – Tharindu Ramesh Ketipearachchi Sep 20 '15 at 06:23
  • I have modified the code several times. But each time it returns the same start line for every end line. – Tharindu Ramesh Ketipearachchi Sep 20 '15 at 06:33
  • I did the changes you have suggested. But still it gives the wrong answer.https://jsfiddle.net/pe68xc65/ – Tharindu Ramesh Ketipearachchi Sep 20 '15 at 06:52