1

i have the following code for detecting duplicates in a file and outputing them in 3 separate files, one for non-duplicates, one for duplicates (x2) and one for duplicates (> x2). The first file, holds only lines that had no duplicates in the original file. (It doesnt remove any duplicate lines found, it keeps singles).

import os
import sys
import time
import collections


file_in = sys.argv[1]
file_ot = str(file_in) + ".proc"
file_ot2 = str(file_in) + ".proc2"
file_ot3 = str(file_in) + ".proc3"


counter = 0        

dict_in = collections.defaultdict(list)  
with open(file_in, "r") as f:  
    for line in f:  
        #print("read line: " + str(line))
        counter += 1
        fixed_line = line.strip()
        line_list = fixed_line.split(";")
        key = line_list[0][:12]
        print(":Key: " + str(key))
        dict_in[key].append(line)


with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
    selector = {1: f1, 2: f2}  
    for values in dict_in.values():  
        if len(values) == 1:
            f1.writelines(values)
        elif len(values) == 2:
            f2.writelines(values)
        else:
            f3.writelines(values)



print("Read: " + str(counter) + " lines")

The above code works, but for v large files (~1g) it takes about ten minutes to chomp through them on my system. I was wondering if there was a way to optimize the speed of this code, or any suggestions in that direction. Thank you in advance!

Input data example:

0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;

Expected output:

FILE1:
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;

FILE2:
0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;

FILE3:
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X
onlyf
  • 767
  • 3
  • 19
  • 39
  • Show your input data and expeced output. – Alderven Feb 19 '19 at 12:16
  • Your example might be a bit erroneous, as I wouldn't call `0000AAAAAAAA;X;;X;` and `0000AAAAAAAA;X;X;;` duplicates, since they differ in the end. Or do you always want to compare only the first 12 characters, as I would assume from the part `key = line_list[0][:12]`? – TabeaKischka Feb 19 '19 at 12:45
  • also, if you were on a UNIX system, I'm optimistic that using `sort` and `uniq` would be faster than Python. – TabeaKischka Feb 19 '19 at 12:47
  • Its not a unix system, this is run on windows, and i m only interested in duplicates found in the first 12 characters. – onlyf Feb 19 '19 at 12:48

1 Answers1

2

I used 543MB of random text file to test it.

import time

myList = []

start = time.time()
with open("myFile.txt") as f:
    for line in f:
        line = line.replace("\n","")
        myList.insert(len(myList), line)

with open("dupListaOne.txt", "w") as f1, open ("dupListMore.txt","w") as f2, open("UniqueList.txt","w") as f3:
    new_list = sorted(set(myList))
    for i in range(len(new_list)):
            a = myList.count(new_list[i])
            if ((a-1) == 1):
                f1.write("%s\n" % new_list[i] + " " + str(a-1))
            elif ((a-1) > 1):
                f2.write("%s\n" % new_list[i] + " " + str(a-1))
            else:
                f3.write("%s\n" % new_list[i] + " " + str(a-1))
end = time.time()
print("Time: ",end - start)

f1.close()
f2.close()
f3.close()

Elapsed time: 123.82529425621033 sec. ~ 2 min.

Gaming.ingrs
  • 271
  • 1
  • 13