-1

I am trying to merge two LARGE input files together into 1 output, sorting as I go.

## Above I counted the number of lines in each table

print("Processing Table Lines: table 1 has " + str(count1) + " and table 2 has " + str(count2) )
newLine, compare, line1, line2 = [], 0, [], []

while count1 + count2 > 0:
    if count1 > 0 and compare <= 0: count1, line1 = count1 - 1, ifh1.readline().rstrip().split('\t')
    else: line1 = []
    if count2 > 0 and compare >= 0: count2, line2 = count2 - 1, ifh2.readline().rstrip().split('\t')
    else: line2 = []

    compare = compareTableLines( line1, line2 )
    newLine = mergeLines( line1, line2, compare, tIndexes )

    ofh.write('\t'.join( newLine + '\n'))

What I expect to happen is that as lines are written to output, I pull the next line in the file I used to be read in if available. I also expect that the loop cuts out once both files are empty.

However I keep getting this error: ValueError: Mixing iteration and read methods would lose data

I just don't see how to get around it. Either file is too large to keep in memory so I want to read as I go.

bnp0005
  • 199
  • 1
  • 3
  • 11
  • Are the two input files sorted? Can you show us some examples? And what does `compareTableLines()` do? – brunns Apr 25 '19 at 07:56

1 Answers1

1

Here's an example of merging two ordered files, CSV files in this case, using heapq.merge() and itertools.groupby(). Given 2 CSV files:

x.csv:

key1,99
key2,100
key4,234

y.csv:

key1,345
key2,4
key3,45

Running:

import csv, heapq, itertools

keyfun = lambda row: row[0]

with open("x.csv") as inf1, open("y.csv") as inf2, open("z.csv", "w") as outf:
    in1, in2, out = csv.reader(inf1), csv.reader(inf2), csv.writer(outf)
    for key, rows in itertools.groupby(heapq.merge(in1, in2, key=keyfun), keyfun):
        out.writerow([key, sum(int(r[1]) for r in rows)])

we get:

z.csv:

key1,444
key2,104
key3,45
key4,234

brunns
  • 2,689
  • 1
  • 13
  • 24