3

I'm writing a program to search for a specific line in a very large (unordered) file (so it would be preferred not to load the entire file into memory).

I'm implementing multi threading to speed up the process. I'm trying to give a particular thread a particular part of the file i.e., the first thread would run through the first quarter of the file, the 2nd thread scans (simultaneously) from the endpoint of where the first thread stops and so on.

So to do this I need to find byte location of different parts of the file for simplicity of the question lets say I just want to find the middle of the file. But the problem is each line has a different length so if I just do

fo.seek(0, 2)
end = fo.tell()
mid = end/2
fo.seek(mid, 0)

It could give me the middle of the line. So I need a way to seek to the next or previous newline. Also, note I dont want the exact middle just somewhere around it (since its a very large file).

Heres what I was able to code, I'm not sure whether this loads the file into memory or not. And I would really like to avoid opening 2 instances of the same file (I did so in my program because I didnt want to worry about the offset changing when I read the file).

Any modification (or a new program) which is faster would be appreciated.

fo = open(filename, "rw+")
f2 = open(filename, "rw+")

file_ = dict()

fo.seek(0, 2)
file_['end'] = fo.tell()

file_['mid'] = file_['end'] / 2

fo.seek(file_['mid'], 0)
f2.seek(file_['mid'], 0)

line =  f2.readline()

fo.seek(f2.tell(), 0)
file_['mid'] = f2.tell()


fo.seek(file_['mid'], 0)

print fo.readline()
Grim Reaper
  • 550
  • 1
  • 5
  • 18
  • 2
    As always, IO operations rarely gain speed improvements from multithreading. And I can't deceipher through your code, but I guess after `fo.seek(file_['mid']); fo.readline()` you're at a newline, i.e. isn't `fo.tell()` what you're looking for? – alko Dec 06 '13 at 14:19
  • @alko I thought that in python multithreading's sole purpose was to help with IO operations, and multiprocessing was to help with cpu operations... – Steve P. Dec 06 '13 at 14:25
  • I don't see how reading from a file from several threads could possibly be faster. It could be interesting after the asker figures it out to compare to a simple "for line in open('myfile'):" – Boo Dec 06 '13 at 14:37
  • @SteveP. Generaly speaking, you're right. But I meant that reading **one** file line by line in parallel in several threads won't give much improvement. It's worth sometime to implement more sophisticated approaches f.ex. reading in bulks in main thread (consec read), size depending on hardware, and process retreived data in threads. – alko Dec 06 '13 at 14:40
  • @alko The program given does actually work and I do get the approximate middle of the file stored in file_['mid']. Im asking whether theres a cleaner way to do it. – Grim Reaper Dec 06 '13 at 14:51
  • @GrimReaper do you ever get into `while '\n' not in line:` part? – alko Dec 06 '13 at 14:52
  • @Boo I'm not very sure about that. But the way I see it, if 2 threads search for a line from 2 different directions, it should be faster than 1 thread searching linearly in a large file – Grim Reaper Dec 06 '13 at 14:52
  • 2
    read http://stackoverflow.com/a/3055497/1265154 – alko Dec 06 '13 at 14:54
  • @alko Come to think of it, your right that was unnecessary as readline will only stop at a '\n '. Updated. – Grim Reaper Dec 06 '13 at 14:58
  • @alko Okay, but what if I have to send the line to another system for it to be checked i.e., use something like urllib to send requests. Will multi-threading help then ? – Grim Reaper Dec 06 '13 at 15:04
  • 3
    If you're bound by the read speed, `2 threads search for a line from 2 different directions` would each be running at half the speed of `a single thread searching linearly`. If your check is expensive (e.g. you're querying an external service), it's simpler and easier to read the file linearly in one thread and use a pool of workers to do the checks in parallel. – Lie Ryan Dec 06 '13 at 15:21
  • @alko This is true not just of a single file, but in many cases, of reading different files on the same physical disk. – Gort the Robot Dec 06 '13 at 16:55

1 Answers1

3

How large is very large? grep tears relatively quickly through even 1-10GB files.

If the file is static and you plan to search through it repeatedly, you could split it:

split -l <line_count> <file>

Now you have multiple files, and can pass each to a separate thread/process/whatever.

Is the file sorted? That changes things again, since now you can just binary search with fo.seek() calls.

How fast is fast enough? Beyond a certain point, you're going to have to build a search index. Up to that point, simple tools like grep, split, etc. work wonders.

Without more information, it's impossible to say what the right tradeoffs are here.

candu
  • 2,827
  • 23
  • 18