2

I may be approaching this all wrong but still this is where I'm at. I have very large log files I'm trying to search, up to 30gb in some cases. I'm writing a script to pull info and have been playing with multi process to speed it up a bit. right now I'm testing running two functions at the same time to search from the top and bottom to get results, which seems to work. I'm wondering if it's possible to stop one function one a result from the other. Such as if the top function finds a result they both stop. This way I can build it out as needed.

from file_read_backwards import FileReadBackwards
from multiprocessing import Process
import sys

z = "log.log"

#!/usr/bin/env python
rocket = 0

def top():
    target = "test"
    with open(z) as src:
        found= None
        for line in src:
            if len(line) == 0: break #happens at end of file, then stop loop
            if target in line:
                found= line
                break
    print(found)

def bottom():
    target = "text"
    with FileReadBackwards(z) as src:
        found= None
        for line in src:
            if len(line) == 0: break #happens at end of file, then stop loop
            if target in line:
                found= line
                break
    print(found)


if __name__=='__main__':
     p1 = Process(target = top)
     p1.start()
     p2 = Process(target = bottom)
     p2.start()
Eugene Lisitsky
  • 12,113
  • 5
  • 38
  • 59
JSimonsen
  • 2,642
  • 1
  • 13
  • 13
  • 2
    You'd have to pass a shared flag to both processes. When one finds the result, it sets it; the other sees that it's been set and exits. –  Nov 16 '17 at 00:43
  • Are you searching for something particularly complicated? Because doing it in python is just slow, even with multiprocessing. If you're after performance, you're better off with an existing tool or less sluggish language. The official Process docs discuss several methods of inter-process communication but it seems like massive overkill for what you're trying to do. – pvg Nov 16 '17 at 00:44
  • 3
    Also, I'd be curious to see this `FileReadBackwards`. It seems more like you'd want to split the file into chunks and read each chunk in the forward direction. –  Nov 16 '17 at 00:44
  • @Blurp I'm guessing the concern here is being able to immediately start searching without having to look for a line break on which the file must be split. – pvg Nov 16 '17 at 00:45
  • @pvg I guess, but you could split the file in half, read ahead to the next newline and then continue. Maybe reading files backwards isn't as slow as I assume it would be though. –  Nov 16 '17 at 00:50
  • Hey all. Can't really split the file in half as the files exist on our log server. Can't really use another established tool, even though I'd love to, due to restrictions on my team etc etc. So, only option right now is scripting against it. I was thinking shared flag @Blurp – JSimonsen Nov 16 '17 at 00:53
  • 1
    There's no actual splitting. You would seek to the midpoint of the file and go from there using the same logic as `top()`. –  Nov 16 '17 at 00:54
  • 1
    You don't have to literally split it. You just have to find the right offset then seek to it, pass the offset to both processes. As to restrictions, is grep or ag restricted? – pvg Nov 16 '17 at 00:54
  • Ah sorry it's been a long day. I'm not sure why I thought literally split it. Is there an example of that you can provide? Granted my method isn't ideal, but it could help speed things up a tad. Grep on our servers is also painfully slow. I've done testing on all kinds of things, grep, awk, ag, various python scripts. Best speeds I've had so far are with python. Limitation there is if searching from something at the bottom of the file it still takes as long as others. Anything near the top is near instant. – JSimonsen Nov 16 '17 at 00:59
  • 1
    What kind of storage are you reading from? I do this a fair bit - text processing on big chunks of text, often with Python and what what you're describing does not match my experience at all, in terms of the performance difference. I usually read off SSDs. – pvg Nov 16 '17 at 01:09
  • Here's another relevant question: why are you using processes instead of threads? – Mad Physicist Nov 16 '17 at 02:37
  • What version of python are you using? – mikeLundquist Nov 16 '17 at 02:39
  • Python3 usually. @MadPhysicist Its a solid question. Do you think it would improve performance? – JSimonsen Nov 16 '17 at 17:52
  • @JSimonsen. It reduces copying, so threads are usually much cheaper than processes. – Mad Physicist Nov 16 '17 at 17:58

3 Answers3

2

Here's a proof-of-concept of the approach I mentioned in the comments:

import os
import random
import sys
from multiprocessing import Process, Value


def search(proc_no, file_name, seek_to, max_size, find, flag):
    stop_at = seek_to + max_size

    with open(file_name) as f:
        if seek_to:
            f.seek(seek_to - 1)
            prev_char = f.read(1)
            if prev_char != '\n':
                # Landed in the middle of a line. Skip back one (or
                # maybe more) lines so this line isn't excluded. Start
                # by seeking back 256 bytes, then 512 if necessary, etc.
                exponent = 8
                pos = seek_to
                while pos >= seek_to:
                    pos = f.seek(max(0, pos - (2 ** exponent)))
                    f.readline()
                    pos = f.tell()
                    exponent += 1

        while True:
            if flag.value:
                break
            line = f.readline()
            if not line:
                break  # EOF
            data = line.strip()
            if data == find:
                flag.value = proc_no
                print(data)
                break
            if f.tell() > stop_at:
                break


if __name__ == '__main__':
    # list.txt contains lines with the numbers 1 to 1000001
    file_name = 'list.txt'
    info = os.stat(file_name)
    file_size = info.st_size

    if len(sys.argv) == 1:
        # Pick a random value from list.txt
        num_lines = 1000001
        choices = list(range(1, num_lines + 1))
        choices.append('XXX')
        find = str(random.choice(choices))
    else:
        find = sys.argv[1]

    num_procs = 4
    chunk_size, remainder = divmod(file_size, num_procs)
    max_size = chunk_size + remainder
    flag = Value('i', 0)
    procs = []

    print(f'Using {num_procs} processes to look for {find} in {file_name}')

    for i in range(num_procs):
        seek_to = i * chunk_size
        proc = Process(target=search, args=(i + 1, file_name, seek_to, max_size, find, flag))
        procs.append(proc)

    for proc in procs:
        proc.start()

    for proc in procs:
        proc.join()

    if flag.value:
        print(find, 'found by proc', flag.value)
    else:
        print(find, 'not found')

After reading various posts[1] about reading files with multiprocessing and multithreading, it seems that neither is a great approach due to potential disk thrashing and serialized reads. So here's a different, simpler approach that is way faster (at least for the file with a million lines I was trying it out on):

import mmap
import sys

def search_file(file_name, text, encoding='utf-8'):
    text = text.encode(encoding)
    with open(file_name) as f:
        with mmap.mmap(f.fileno(), 0, flags=mmap.ACCESS_READ, prot=mmap.PROT_READ) as m:
            index = m.find(text)
            if index > -1:
                # Found a match; now find beginning of line that
                # contains match so we can grab the whole line.
                while index > 0:
                    index -= 1
                    if m[index] == 10:
                        index += 1
                        break
                else:
                    index = 0
                m.seek(index)
                line = m.readline()
                return line.decode(encoding)

if __name__ == '__main__':
    file_name, search_string = sys.argv[1:]
    line = search_file(file_name, search_string)
    sys.stdout.write(line if line is not None else f'Not found in {file_name}: {search_string}\n')

I'm curious how this would perform with a 30GB log file.

[1] Including this one

0

Simple example using a multiprocessing.Pool and callback function. Terminates remaining pool processes once a result has returned.

You could add an arbitrary number of processes to search from different offsets in the file using this approach.

import math
import time

from multiprocessing import Pool
from random import random


def search(pid, wait):
    """Sleep for wait seconds, return PID
    """
    time.sleep(wait)
    return pid


def done(result):
    """Do something with result and stop other processes
    """
    print("Process: %d done." % result)
    pool.terminate()
    print("Terminate Pool")


pool = Pool(2)
pool.apply_async(search, (1, math.ceil(random() * 3)), callback=done)
pool.apply_async(search, (2, math.ceil(random() * 3)), callback=done)

# do other stuff ...

# Wait for result
pool.close()
pool.join()  # block our main thread
stacksonstacks
  • 8,613
  • 6
  • 28
  • 44
0

This is essentially the same as Blurp's answer, but I shortened it and made it a bit to make it more general. As you can see top should be an infinite loop, but bottom stops top immediately.

from multiprocessing import Process
valNotFound = True
def top():
    i=0
    while ValNotFound:
        i += 1


def bottom():
    ValNotFound = False


p1 = Process(target = top)
p2 = Process(target = bottom)
p1.start()
p2.start()
mikeLundquist
  • 769
  • 1
  • 12
  • 26