Iterating multiple times through a file is possible (you can reset the file to the start by calling thefile.seek()
) but likely to be very costly.
Let's say for generality you have a function to identify the key number given a line, e.g
def getkey(line):
return line.split()[1]
in your example where the key is the second of three space-separated words in the line. Now, if the data for the second file will fit comfortably in RAM (so up to a few GB -- think how long it would take to iterate hundreds of times on that!-)...:
key2line = {}
with open(secondfile) as f:
for line in f:
key2line[getkey(line)] = line
with open(firstfile) as f:
order = [line.strip() for line in f]
with open(outputfile, 'w') as f:
for key in order:
f.write(key2line[key])
Now isn't that a pretty clear and effective approach...?
If the second file is too big by a small factor, say 10 times or so, what you can actually fit into memory, then you may still be able to solve it at the cost of lots of jumping around in the file, by using seek and tell.
The first loop would become:
key2offset = {}
with open(secondfile) as f:
offset = 0
for line in f:
new_offset = f.tell()
key2line[getkey(line)] = offset
offset = new_offset
and the last loop would become:
with open(secondfile) as f:
with open(outputfile, 'w') as f1:
for key in order:
f.seek(key2offset[key])
line = f.readline()
f1.write(line)
A bit more complex, much slower -- but still way faster than re-reading a bazillion times, over and over, a file of tens of GB!-)