0

In a script, I'm writing lines to a file, but some of the lines may be duplicates. So I've created a temporary cStringIO file-like object, which I call my "intermediate file". I write the lines to the intermediate file first, remove duplicates, then write to the real file.

So I wrote a simple for loop to iterate through every line in my intermediate file and remove any duplicates.

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

My problem is that the for loop never gets executed. I can verify this by putting in a breakpoint in my debugger; that line of the code just gets skipped and the function exits. I even read this answer from this thread and inserted the code cStringIO.OutputType.getvalue(f_temp), but that didn't solve my issue.

I'm lost as to why I can't read and iterate through my file-like object.

Alureon
  • 179
  • 1
  • 3
  • 14
  • is `f_temp` a file-object? What is the purpose of `cStringIO.OutputType.getvalue(f_temp)`...? – juanpa.arrivillaga Feb 07 '18 at 21:29
  • @juanpa.arrivillaga Yes, it's a file-like object. Apparently, the purpose of `cStringIO.OutputType.getvalue(f_temp)` is to convert the `cStringIO` file-like object into the `Output` type so it can be read. See [this](https://stackoverflow.com/a/40553378/8117081) comment. – Alureon Feb 07 '18 at 21:52

1 Answers1

2

The answer you referenced was a little incomplete. It tells how to get the cStringIO buffer as a string, but then you have to do something with that string. You can do that like this:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

But it is probably better to use normal IO operations on the f_temp "file handle", like this:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

Here's a test (with either one):

import cStringIO, os

def define_outputs(dir_out):
    return open('/tmp/test.txt', 'w') 

def compute_md5(line):
    return line

f = cStringIO.StringIO()
f.write('string 1\n')
f.write('string 2\n')
f.write('string 1\n')
f.write('string 2\n')
f.write('string 3\n')

remove_duplicates(f, 'tmp')
with open('/tmp/test.txt', 'r') as f:
    print(str([row for row in f]))
# ['string 1\n', 'string 2\n', 'string 3\n']
Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
  • `f_temp.seek(0)` works! Thank you! I've another quick question. Since `f_temp` (or any `cStringIO` object) is a "file-like" object, is it necessary to write `f_temp.close()` once I'm done reading all of its lines? – Alureon Feb 07 '18 at 22:02
  • 1
    I certainly would close it after you are done with it. With files or StringIO, the related resources are automatically released by the garbage collector when the last reference goes out of scope, but it's not considered good form to rely on that. Better to close the object explicitly when you're done with it. That is especially important if you'll be quickly creating and closing lots of them. Often this is easiest to achieve by using a `with` clause on the `open` step. – Matthias Fripp Feb 07 '18 at 22:16