1

So I have a csv file I want to import and want to skip importing both the duplicate and the original line from the csv file based on a user number in the first column and I'm using the StringIO module. The way I'm doing it currently is below which is incorrect because even though it skips the duplicate line, it'll still import the original line I believe. What would be the best way to skip importing both duplicate and original lines from csv?

    def csv_import(stream):

        ostream = StringIO()
        headers = stream.readline()
        ostream.write(headers)

        seen_user_numbers = {}

        for row in stream:
            list_row = row.split(',')
            user_number = list_row[0]

            if user_number in seen_user_numbers:
                seen_user_numbers.pop(user_number)
                continue

            seen_user_numbers[user_number] = True
            ostream.write(row)

        ostream.seek(0)
        return ostream
   
teightdev
  • 7
  • 4
  • So, any user mentioned twice should be filtered completely? BTW: It's a bit confusing that you talk about duplicate lines but only compare the first column. – Ulrich Eckhardt Jan 18 '22 at 13:49
  • I want skip both duplicate + original based on the values in the first column. If the first column contains duplicate values then I want to skip importing both duplicate + original – teightdev Jan 18 '22 at 13:54

1 Answers1

1

Because you can't be sure that lines will be included until you reach the end of the input file, you will need to store all the non-excluded lines in memory before you can write them out to file.

You could do this with a dictionary:

def csv_import(stream):

    ostream = StringIO()
    headers = stream.readline()
    ostream.write(headers)

    outputLines = dict()  # will use None for lines to exclude

    for row in stream:
        list_row = row.split(',')
        user_number = list_row[0]
        
        if user_number in outputLines:
            outputLines[user_number] = None
        else:
            outputLines[user_number] = row
        
    for row in filter(None,outputLines.values()):
        ostream.write(row)

    ostream.seek(0)
    return ostream
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • thanks! why did you choose ```dict()``` instead of literal ```{}```...isn't ```{}``` faster? – teightdev Jan 18 '22 at 14:25
  • just a personal preference. I find dict() to be more expressive of the intent. Also, it takes me an extra microsecond of thought to mentally equate {} to dict() and not to an empty set (but that's probably just me) – Alain T. Jan 18 '22 at 14:33