I'm writing a program to search for a specific line in a very large (unordered) file (so it would be preferred not to load the entire file into memory).
I'm implementing multi threading to speed up the process. I'm trying to give a particular thread a particular part of the file i.e., the first thread would run through the first quarter of the file, the 2nd thread scans (simultaneously) from the endpoint of where the first thread stops and so on.
So to do this I need to find byte location of different parts of the file for simplicity of the question lets say I just want to find the middle of the file. But the problem is each line has a different length so if I just do
fo.seek(0, 2)
end = fo.tell()
mid = end/2
fo.seek(mid, 0)
It could give me the middle of the line. So I need a way to seek to the next or previous newline. Also, note I dont want the exact middle just somewhere around it (since its a very large file).
Heres what I was able to code, I'm not sure whether this loads the file into memory or not. And I would really like to avoid opening 2 instances of the same file (I did so in my program because I didnt want to worry about the offset changing when I read the file).
Any modification (or a new program) which is faster would be appreciated.
fo = open(filename, "rw+")
f2 = open(filename, "rw+")
file_ = dict()
fo.seek(0, 2)
file_['end'] = fo.tell()
file_['mid'] = file_['end'] / 2
fo.seek(file_['mid'], 0)
f2.seek(file_['mid'], 0)
line = f2.readline()
fo.seek(f2.tell(), 0)
file_['mid'] = f2.tell()
fo.seek(file_['mid'], 0)
print fo.readline()