In a script, I'm writing lines to a file, but some of the lines may be duplicates. So I've created a temporary cStringIO
file-like object, which I call my "intermediate file". I write the lines to the intermediate file first, remove duplicates, then write to the real file.
So I wrote a simple for loop to iterate through every line in my intermediate file and remove any duplicates.
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
My problem is that the for
loop never gets executed. I can verify this by putting in a breakpoint in my debugger; that line of the code just gets skipped and the function exits. I even read this answer from this thread and inserted the code cStringIO.OutputType.getvalue(f_temp)
, but that didn't solve my issue.
I'm lost as to why I can't read and iterate through my file-like object.