Iterate through a file multiple times

Question

I have one file where each line is a number:

The second file has the same number of lines that look like this:

something1 123 something2
something3 345 something4
something5 456 something6
something7 567 something7

So, the second file has numbers that are ordered and the first file doesn't. I want to re-order the second file like this:

something7 567 something7
something3 345 something4
something5 456 something6
something1 123 something2

I don't know how to iterate through the second file multiple times. When I take the first value from the first file and look for it in the second file, it searches the second file and never re-iterates through it again.

possible duplicate of [f.seek() and f.tell() to read each line of text file](http://stackoverflow.com/questions/15594817/f-seek-and-f-tell-to-read-each-line-of-text-file) — ha9u63a7, Mar 18 '15 at 23:11
Please don't just ask us to solve the problem for you. Show us how _you_ tried to solve the problem yourself, then show us _exactly_ what the result was, and tell us why you feel it didn't work. See "[What Have You Tried?](http://whathaveyoutried.com/)" for an excellent article that you _really need to read_. — John Saunders, Mar 19 '15 at 01:25

score 1 · Answer 1 · answered Mar 18 '15 at 23:12

1

Look into using seek(). Once you've gone through the file once, do [fileobject].seek() and you can go through the file again.

Furthermore, seek() by default will go to the beginning of the file, if you want a particular point in the file you can pass an argument.

answered Mar 18 '15 at 23:12

CSCFCEM

116
6

I don't really understand how seek() works. I will look into it. – user20150316 Mar 18 '15 at 23:25
I don't understand how to find line number with seek and match them – user20150316 Mar 18 '15 at 23:26

Alex Martelli · Accepted Answer · 2015-03-18T23:30:14.250

Iterating multiple times through a file is possible (you can reset the file to the start by calling thefile.seek()) but likely to be very costly.

Let's say for generality you have a function to identify the key number given a line, e.g

def getkey(line):
    return line.split()[1]

in your example where the key is the second of three space-separated words in the line. Now, if the data for the second file will fit comfortably in RAM (so up to a few GB -- think how long it would take to iterate hundreds of times on that!-)...:

key2line = {}
with open(secondfile) as f:
    for line in f:
        key2line[getkey(line)] = line

with open(firstfile) as f:
    order = [line.strip() for line in f]

with open(outputfile, 'w') as f:
    for key in order:
        f.write(key2line[key])

Now isn't that a pretty clear and effective approach...?

If the second file is too big by a small factor, say 10 times or so, what you can actually fit into memory, then you may still be able to solve it at the cost of lots of jumping around in the file, by using seek and tell.

The first loop would become:

key2offset = {}
with open(secondfile) as f:
    offset = 0
    for line in f:
        new_offset = f.tell()
        key2line[getkey(line)] = offset
        offset = new_offset

and the last loop would become:

with open(secondfile) as f:
    with open(outputfile, 'w') as f1:
        for key in order:
            f.seek(key2offset[key])
            line = f.readline()
            f1.write(line)

A bit more complex, much slower -- but still way faster than re-reading a bazillion times, over and over, a file of tens of GB!-)

Iterate through a file multiple times

2 Answers2